PDF In-Depth

PDF-to-Word Conversion: Why it's so hard to do

Rediscovering words one object at a time



To begin with, the PDF-to-Word converter must put the words back together. We can look at each text object, its properties, the distance it is from surrounding text objects, and start to see where words and whitespace between them might exist.

Good conversion starts at the accurate detection of line ends

The key to recreating an editable Word file is accurately detecting where each line ends. If you take a look at the example below, it's pretty easy to see where each of the lines end, new paragraphs start, and columns are placed alongside each other, but inside the PDF there is nothing that notes these facts.

If you look at the example below, getting it wrong could easily result in one line of text in the left column merging with a completely unrelated line of text from the right column -- not a hard mistake to make when you're making your decision based on the amount of whitespace between text objects.

Line breaks in PDF

Recreating editable paragraphs and detecting them accurately is next

Detecting line ends correctly not only saves you from merging columns of content together, it does the even more important task of starting to rebuild the structure of the text content. Once you can see a series of horizontal lines you can start deducing where a paragraph with reflowing text might need to be -- and once you have that you can start re-creating a Microsoft Word file that is highly editable.

Of course, it's never that simple (if it was I wouldn't be writing about it!). Unfortunately, paragraph-based content can be presented in many different ways, which makes the accurate detection and reproduction even more difficult. Examples include:

  • Different text styles and colors.
  • Drop caps at the beginning of chapters or sections.
  • Indents to multiply lines to indicate new (and different) paragraphs of content such as quotes.
  • First-line indents to indicate the start (and end) of paragraphs.
  • Changes in or maintaining the same alignment, such as left, right, centered and justified across lines can indicate separate paragraphs and content blocks.
  • Changes in line spacing, which can even show that the content was originally part of a separate paragraph and should therefore be treated differently.

Paragraphs in PDF

A few examples of the different kinds of paragraphs a document might contain.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.