August 17, 2012



Editor's Note: Leon is a developer at IDRsolutions, creators of a 100% Java PDF Library and PDF to HTML5 Converter. He blogs at Java PDF Blog.

Microsoft recently announced Microsoft Office 2013 -- their latest update to Microsoft Office. One of the things that interested me in their announcement was the inclusion of a "PDF Reflow" feature for Microsoft Word. This will allow us to "open a PDF in Word, and its paragraphs, lists, tables, and other content" will "act just like Word content".

The PDF file format is notoriously difficult to convert into other formats -- it lets you do things that just don't translate well into other technologies. I spend a lot of my day working out how to emulate some of the features of PDF as HTML5 while developing a PDF to HTML5 Converter in Java for IDRsolutions. Limitless accuracy, decimal font sizes, individual glyph spacing and non-rgb images are just a handful of the things that are difficult to convert accurately or with unwanted side effects (for example huge file size).

And that's just for rendering PDF. Editing the content of a PDF is a whole different ball game, and brings with it a whole new set of problems. Most importantly, if you are making a PDF "Reflow", how do you deal with text that has no structure? A PDF is not created how you would expect -- it does not contain marked up text -- it contains instructions for where to draw text. And the order the text is drawn does not necessarily correlate with the order in which the text was written.

Let's use a newspaper story to illustrate the problem. The text in the columns are usually justified, so the start of the text touches the start of the column, and the end of the text touches the end of the column. This is done by using individual word spacing, or individual character spacing. What if there are two columns? Should the text flow down a column, or across the page? As humans, we can very easily detect the pattern and understand the flow, but your average PDF file simply does not have enough information for a computer to be able to accurately determine the flow of text for whatever file it may encounter.

Knowing this, I was excited to see how well Microsoft have done. I quickly downloaded their free preview to try out!

Naturally, the first PDF that I chose to open was one of the nastiest (but perfectly valid) files that I knew we had. The file spent some time converting, and when it appeared, I wasn't disappointed. The first few pages appeared to display correctly, and then it went downhill. A page had been added from the text on the page before it overflowing onto it, a page had been output as an image with uneditable text (we output as text), two pages seemed to have combined together, a page had an inch of the background from another page at the bottom, and going down far enough, Word even froze on me and had to be forcibly quit.

That is from simply viewing the file alone, making no changes. At the very least I would expect for them to at least be able to display the PDF file, and for a page to be a page. Fundamentally, this needs to be correct before you should think about being able to make changes to it and to "reflow" it. So how are the reflow features? Well, I wasn't wrong with what I said earlier about the text not specifically having a flow. While Microsoft have made a good attempt, some lines are actually meant to be two columns, most lines don't link with the line below, and generally you can edit words, but if you want to add a sentence into a news story, forget about it.

I have experimented with quite a few PDF files now, and the result is much the same as my experience above. In my opinion Microsoft have not really thought this through. I'm sure that there will be improvements in the final version, but editing PDF files in this way is a flawed concept. For an average home made PDF file that was generated with Microsoft Word, this works quite well (though there are still issues with them displaying exactly as intended), but for anything more, the PDF file format simply doesn't lend itself to being played with like this.

