PDF In-Depth

PDF-to-Word Conversion: Why it's so hard to do

About the Author
 
Richard Crocker picture

Richard Crocker

Richard Crocker is a director, product designer and marketer at Nitro PDF Software, and first became involved with PDF back in the days of version 1.1 and Acrobat Exchange 2.0. He helped launch the Planet PDF site in 1998. Today you can...  More


 

 
 

Editor's Note: This article originally appeared on the PDF Blog and came into existence while Nitro was working on a new online service to convert PDF to Word (which has now launched and is free). Planet PDF is a division of Nitro PDF Software.

When a file is converted to PDF, it loses its meaning. On the surface all the information is there, and to your eyes it looks exactly the same, but underneath that, all the method, structure and intelligence used when designing the original document has been lost.† This forms the heart of the challenge faced when attempting to convert PDF files back to formats like DOC (Microsoft Word), RTF and HTML, and is not dissimilar to those faced when OCRing paper-based documents.

Once you have your PDF file, the original layout and meaning formed from text-based building blocks -- including words, lines (and line breaks), paragraphs, columns, tables, headers/footers and outlines -- are long gone. Once in a PDF, its content just describes how and where on the page each object should be displayed.

This is a far cry from where you would be if you went back to the original file in Microsoft Word, Open Office, Google Docs, Adobe InDesign, or whatever. These kinds of word processing and desktop publishing applications follow similar principles, and it's why converting files between them (while certainly not perfect) is a much more simple process.

How files are normally designed and edited in word processing applications

Most word processing applications use the same sort of principles for formatting and giving meaning to content. For the sake of this article, I'll use Microsoft Word as the example. Here's a few of the main ones:

  • Paragraphs let you work with text that reflows across lines and can be quickly reformatted using styles to adjust spacing, indent, size and more.
  • Columns let you incorporate more complex page layouts and in many cases make content easier to follow and give meaning to using different grouping styles.
  • Tables let you layout tabular information not suited to the more linear formatting offered by paragraphs and columns.
  • Headers & footers let you repeat content more consistently across multiple pages.

PDF to Word is like the OCR process

If you're familiar with optical character recognition (OCR) and converting paper to electronic form, you might have already grasped some of the complexities we're dealing with. Apart from recognizing fonts and how they should be displayed on the page, the challenges are much the same for both as all meaning and structure is gone from the contents.

The loss of the text stream

Take a look at the screenshot below. The first three lines of text show how it is displayed on the page in a PDF. The second shows how many separate objects the text is broken into inside the PDF. For each small text object, the PDF includes co-ordinates that simply describe where it should be positioned on the page and how it should be displayed.

Text objects in PDF

The first challenge for exporting text back out of PDF files comes when the streams of text from the original word processor get broken up into these seemingly random chunks. From here we must start to discern what their relationship is to the content around them. This process begins by sucking out all the text from the PDF.

† It is possible to create PDF files with embedded structure information in them, however most PDF files don't have this structure.




Related Products at PDF Store

Nitro PDF Professional

Nitro PDF Professional, your PDF creation and editing product. Priced at $99, Nitro PDF Pro is the m... View full product details
Download free demo

Adobe? Acrobat? & PDF Software

The No.1 PDF and Acrobat software store for tools to create, edit and publish PDF files. Get Nitro P... View full product details
Download free demo

ARTS PDF Aerialist

Take Acrobat to the next level with advanced splitting and merging; flexible bookmark creation and m... View full product details
Download free demo

PDF In-Depth Free Product Trials Ubiquitous PDF

Nitro PDF Professional

the perfect PDF product for business and enterprise, combining an extremely competitive price with a...

Download free demo

XpdfViewer

This ActiveX control (OCX) provides a PDF file viewer component, enabling developers to add PDF viewing...

Download free demo

Ubiquitous PDF: Printable PDF 2010 calendars

With the new year just around the next bend, you might be looking to update your calendars. Thanks to Vertex42.com, users can now download and print a free PDF calendar for 2010. The page lets visitors select from a variety of styles and offers both yearly and monthly calendars.

December 10, 2009
Search Planet PDF
more searching options...







Create PDF Free

Most Popluar Articles
Planet PDF Newsletter
Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.

Featured Product

Docmetrics

Generate more, higher-quality sales leads from your PDF marketing content. Docmetrics is a web-based system that lets you capture previously unavailable reader data. Free trial.

Platinum Sponsor
Create & Edit PDF - Nitro PDF Software

ARTS PDF

Silver Sponsors

PDF-Tools enfocus

QuickPDF: The Unrivaled PDF Developer Toolkit