PDF In-Depth

PDF-to-Word Conversion: Why it's so hard to do

Rediscovering words one object at a time

Advertisement
 

 

To begin with, the PDF-to-Word converter must put the words back together. We can look at each text object, its properties, the distance it is from surrounding text objects, and start to see where words and whitespace between them might exist.

Good conversion starts at the accurate detection of line ends

The key to recreating an editable Word file is accurately detecting where each line ends. If you take a look at the example below, it's pretty easy to see where each of the lines end, new paragraphs start, and columns are placed alongside each other, but inside the PDF there is nothing that notes these facts.

If you look at the example below, getting it wrong could easily result in one line of text in the left column merging with a completely unrelated line of text from the right column -- not a hard mistake to make when you're making your decision based on the amount of whitespace between text objects.

Line breaks in PDF

Recreating editable paragraphs and detecting them accurately is next

Detecting line ends correctly not only saves you from merging columns of content together, it does the even more important task of starting to rebuild the structure of the text content. Once you can see a series of horizontal lines you can start deducing where a paragraph with reflowing text might need to be -- and once you have that you can start re-creating a Microsoft Word file that is highly editable.

Of course, it's never that simple (if it was I wouldn't be writing about it!). Unfortunately, paragraph-based content can be presented in many different ways, which makes the accurate detection and reproduction even more difficult. Examples include:

  • Different text styles and colors.
  • Drop caps at the beginning of chapters or sections.
  • Indents to multiply lines to indicate new (and different) paragraphs of content such as quotes.
  • First-line indents to indicate the start (and end) of paragraphs.
  • Changes in or maintaining the same alignment, such as left, right, centered and justified across lines can indicate separate paragraphs and content blocks.
  • Changes in line spacing, which can even show that the content was originally part of a separate paragraph and should therefore be treated differently.

Paragraphs in PDF


A few examples of the different kinds of paragraphs a document might contain.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.