PDF In-Depth

PDF-to-Word Conversion: Why it's so hard to do

About the Author
 
Richard Crocker picture

Richard Crocker

Richard Crocker is a director, product designer and marketer at Nitro PDF Software, and first became involved with PDF back in the days of version 1.1 and Acrobat Exchange 2.0. He helped launch the Planet PDF site in 1998. Today you can...  More


 

 
 

Editor's Note: This article originally appeared on the PDF Blog and came into existence while Nitro was working on a new online service to convert PDF to Word (which has now launched and is free). Planet PDF is a division of Nitro PDF Software.

When a file is converted to PDF, it loses its meaning. On the surface all the information is there, and to your eyes it looks exactly the same, but underneath that, all the method, structure and intelligence used when designing the original document has been lost.† This forms the heart of the challenge faced when attempting to convert PDF files back to formats like DOC (Microsoft Word), RTF and HTML, and is not dissimilar to those faced when OCRing paper-based documents.

Once you have your PDF file, the original layout and meaning formed from text-based building blocks -- including words, lines (and line breaks), paragraphs, columns, tables, headers/footers and outlines -- are long gone. Once in a PDF, its content just describes how and where on the page each object should be displayed.

This is a far cry from where you would be if you went back to the original file in Microsoft Word, Open Office, Google Docs, Adobe InDesign, or whatever. These kinds of word processing and desktop publishing applications follow similar principles, and it's why converting files between them (while certainly not perfect) is a much more simple process.

How files are normally designed and edited in word processing applications

Most word processing applications use the same sort of principles for formatting and giving meaning to content. For the sake of this article, I'll use Microsoft Word as the example. Here's a few of the main ones:

  • Paragraphs let you work with text that reflows across lines and can be quickly reformatted using styles to adjust spacing, indent, size and more.
  • Columns let you incorporate more complex page layouts and in many cases make content easier to follow and give meaning to using different grouping styles.
  • Tables let you layout tabular information not suited to the more linear formatting offered by paragraphs and columns.
  • Headers & footers let you repeat content more consistently across multiple pages.

PDF to Word is like the OCR process

If you're familiar with optical character recognition (OCR) and converting paper to electronic form, you might have already grasped some of the complexities we're dealing with. Apart from recognizing fonts and how they should be displayed on the page, the challenges are much the same for both as all meaning and structure is gone from the contents.

The loss of the text stream

Take a look at the screenshot below. The first three lines of text show how it is displayed on the page in a PDF. The second shows how many separate objects the text is broken into inside the PDF. For each small text object, the PDF includes co-ordinates that simply describe where it should be positioned on the page and how it should be displayed.

Text objects in PDF

The first challenge for exporting text back out of PDF files comes when the streams of text from the original word processor get broken up into these seemingly random chunks. From here we must start to discern what their relationship is to the content around them. This process begins by sucking out all the text from the PDF.

† It is possible to create PDF files with embedded structure information in them, however most PDF files don't have this structure.




Related Products at PDF Store

Nitro PDF Professional

Nitro PDF Professional, your PDF creation and editing product. Priced at $99, Nitro PDF Pro is the m... View full product details
Download free demo

Adobe? Acrobat? & PDF Software

The No.1 PDF and Acrobat software store for tools to create, edit and publish PDF files. Get Nitro P... View full product details
Download free demo

ARTS PDF Aerialist

Take Acrobat to the next level with advanced splitting and merging; flexible bookmark creation and m... View full product details
Download free demo

PDF In-Depth Free Product Trials Ubiquitous PDF

Nitro PDF Professional

the perfect PDF product for business and enterprise, combining an extremely competitive price with a...

Download free demo

XpdfViewer

This ActiveX control (OCX) provides a PDF file viewer component, enabling developers to add PDF viewing...

Download free demo

Ubiquitous PDF: PDF eBooks-Library

If you are looking for a good store of PDF content, you could do a lot worse than visiting eBooks-Library.com...

September 03, 2009
Search Planet PDF
more searching options...







Convert PDF Files

Planet PDF Newsletter
Most Popluar Articles
Features

How to Create Slide Shows and Self-running Kiosks in Acrobat

In this tutorial, Ted Padova and Wendy Halderman explain how to best use the features of Acrobat 6 Professional to create a self-running multi-media kiosk for use with displays such as tradeshow exhibits.

Featured Product

ARTS PDF Aerialist

The ultimate plug-in for Adobe Acrobat and #1 selling product at PDF Store. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Platinum Sponsor
Create & Edit PDF - Nitro PDF Software

ARTS PDF

Silver Sponsors

PDF-Tools enfocus

QuickPDF: The Unrivaled PDF Developer Toolkit