PDF In-Depth

PDF's Brittleness: A Lament PDF - An optimal solution

February 10, 2001

Advertisement
Advertisement
 

PDF is one of the most brittle and unforgiving formats I've ever worked with, in the sense that if you introduce even one stray byte into the middle of a file, you stand a 99% chance of wrecking the reading frame for the whole file (kind of like a frameshift error in DNA giving rise to "nonsense proteins"), totally breaking the world.

I've worked with brittle file formats over the years, especially in the area of data compression, and I can tell you that there are a number of interesting approaches to the problem of "brittleness" in files that are sensitive to "spot mutations." Some of the work on this goes back to the 1940s. I'm talking about error detection and recovery schemes involving checksum algorithms; the stuff of Xmodem/Ymodem/Kermit/etc. Plus much more subtle kinds of annealing. Many of these schemes involve adding redundancy to a file and so don't work to the advantage of a COMRPESSION scheme (duh!!!), but the point is, you can add robustness back to a brittle file in almost any dosage you want.

Is PDF a destructive format...?

My observation with regard to PDF is this. PDF is a brittle format. You look at it sideways and it breaks. Mainly I'm talking about the absolute need to reconcile every object offset to an 'xref' table. Anybody who has tried to hand-edit a PDF file knows what I am talking about. If you skew an offset, you screw an offset. The reason for having this built-in brittleness is, ostensibly, performance. Table lookups are faster than walking a linked list. With a huge document, all search, navigation, update, and display performance characteristics depend on the speed of direct table lookups.

But we pay a terrible price for this performance, it seems to me. PDF files are, in fact, too easily breakable. It's a curious situation. I've never seen a file format this brittle that didn't depend, somewhere, on cyclic redundancy checks (CRCs) for a check of file integrity. That is, before you do ANYTHING with the file, the first thing you do upon opening it is run a CRC calculation (which takes very little time if you do it right), and if the CRC check flunks, you pack it up and tell the user to go home right then and there; you don't bother trying to do anything with the file, because you know it's corrupt. (Well, you 'know' with a high degree of probability that it is corrupt.)

CRCs are a very strict check of file integrity, because one flipped bit in a 100-megabyte file will make the CRC show up bad. I mean, we're talking about a very sensitive integrity sniffer here!

PDF perhaps doesn't need that degree of integrity assurance, but by the same token, it doesn't need to break down completely just because I introduced a stray whitespace character somewhere in the middle of an otherwise perfectly good file. That's the kind of designed-in lack of robustness that bothers me. It's the kind of straightjacket no file format needs, frankly.

Some suggestions and solutions

My solution would be this. No frameshift errors should ever break a PDF file. Ever. What this means is that no PDF file should carry its own hard-coded 'xref' table. The reading application should produce it dynamically, on the fly, at file-open time. At most, the PDF file should store a table of in-use versus defunct objects, so that the reading app can know which objects are usable (includable) for the 'xref' table. But as far as calculating object offsets for every object... that's something that can and should be done at runtime, by the consuming app. Once only, at file-open. Remember, it only has to be done once. After the speed hit of that initial table-tally, you're home free. (Or, you're as free as you were before.)

Just-in-time 'xref' compilation can only bring more flexibility and robustness to the format, it seems to me. It would certainly go a long way toward encouraging people to experiment with the format. Not only would shameless hackers like myself be more likely to spend more time hacking around inside files, but people who write "producing" apps would be more likely, I think, to produce PDF as an output format. Think how much easier it would be to produce dynamic PDF on a server via Perl if you didn't have to fuss with 'xref' tables, for instance.

What's your view? Talkback at the Planet PDF Forum

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.