PDF-to-Word Conversion: Why it's so hard to do

Laying out the page with columns



Like examining the relationship and patterns between lines of text to re-discover paragraphs, we can start to figure out where columns might exist by looking at all the text and paragraphs on the page. For example, if we see a series of paragraphs, whose alignment on the left is all on the same vertical axis and each paragraph uses similar text styles and spacings, we may well have found ourselves a column of content.

For it to all come together well, we need to take a holistic approach and not assume too much before looking at all the page content. Once you think you have the overall text layout for the page, you need to take all the necessary page measurements so you can make use of them in Microsoft Word.

Using the Column settings in Word, you need to specify the number of columns, their widths and spacing between them, and then insert column- and line-breaks to ensure the text is placed in the right column and in the right area of the page.

Truly editable comes with advanced table detection

Tabulated content (i.e. tables) is similar to columns only more complex. You're dealing with columns and rows, and varying degrees of information to discern tables accurately. Quality table detection is bordering on a black art as you each table you encounter is different, forcing you to have a large range of processes to run through before working out whether the content is a table or not.

If you look at the table below you can see how cells and tables can be formatted in different ways. There are more obvious signposts such as cell background coloring and borders to show it's a table, but as you look to the bottom of the table you'll see those indicators are gone.

Table text in PDF files

To get to that level of detection you have to run many processes to find a pattern and accurately identify the presence of a table.

Separating section/chapter-level content from page-level content

When determining the correct page margins to use when converting the PDF to Word, header and footer content often gets in the way and causes layout and editability problems. If you can detect this content and keep it separate, the normal page content is much more likely to lay out well.

You do this by scanning across multiple pages for similar content positioned in the same places at the bottom and/or top of the page, and when you find it keep it separate from the normal content. This advanced technique keeps the pages cleaner and allows you to incorporate this content into the actual header and footer areas of a Microsoft Word document.

