PDF In-Depth

PDF-to-Word Conversion: Why it's so hard to do

About the Author
 
Richard Crocker picture

Richard Crocker

Richard Crocker first became involved with PDF back in the days of version 1.1 and Acrobat Exchange 2.0.  More


 

 
 

When a file is converted to PDF, it loses its meaning. On the surface all the information is there, and to your eyes it looks exactly the same, but underneath that, all the method, structure and intelligence used when designing the original document has been lost.† This forms the heart of the challenge faced when attempting to convert PDF files back to formats like DOC (Microsoft Word), RTF and HTML, and is not dissimilar to those faced when OCRing paper-based documents.

Once you have your PDF file, the original layout and meaning formed from text-based building blocks -- including words, lines (and line breaks), paragraphs, columns, tables, headers/footers and outlines -- are long gone. Once in a PDF, its content just describes how and where on the page each object should be displayed.

This is a far cry from where you would be if you went back to the original file in Microsoft Word, Open Office, Google Docs, Adobe InDesign, or whatever. These kinds of word processing and desktop publishing applications follow similar principles, and it's why converting files between them (while certainly not perfect) is a much more simple process.

How files are normally designed and edited in word processing applications

Most word processing applications use the same sort of principles for formatting and giving meaning to content. For the sake of this article, I'll use Microsoft Word as the example. Here's a few of the main ones:

  • Paragraphs let you work with text that reflows across lines and can be quickly reformatted using styles to adjust spacing, indent, size and more.
  • Columns let you incorporate more complex page layouts and in many cases make content easier to follow and give meaning to using different grouping styles.
  • Tables let you layout tabular information not suited to the more linear formatting offered by paragraphs and columns.
  • Headers & footers let you repeat content more consistently across multiple pages.

PDF to Word is like the OCR process

If you're familiar with optical character recognition (OCR) and converting paper to electronic form, you might have already grasped some of the complexities we're dealing with. Apart from recognizing fonts and how they should be displayed on the page, the challenges are much the same for both as all meaning and structure is gone from the contents.

The loss of the text stream

Take a look at the screenshot below. The first three lines of text show how it is displayed on the page in a PDF. The second shows how many separate objects the text is broken into inside the PDF. For each small text object, the PDF includes co-ordinates that simply describe where it should be positioned on the page and how it should be displayed.

Text objects in PDF

The first challenge for exporting text back out of PDF files comes when the streams of text from the original word processor get broken up into these seemingly random chunks. From here we must start to discern what their relationship is to the content around them. This process begins by sucking out all the text from the PDF.

† It is possible to create PDF files with embedded structure information in them, however most PDF files don't have this structure.




PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.