PDF In-Depth

Navigating the Internal Structure of a PDF Document

About the Author
 

Thom Parker

WindJack Solutions founder Thom Parker has been developing solutions for Adobe software since 1997. WindJack Solutions' focus is PDF development and Thom is actively involved in all things PDF. He developed two innovative tools for PDF developers and users, PDF CanOpener and...  More


 

 
 

A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look at PDF from different perspectives, each with its own advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS Layer organizes this data into a tree of simple objects. At the PD layer, these simple objects are put together to implement useful intermediate level structures like Fonts and Images. These are in turn organized into higher level constructs like Annotations and Pages. Some of these objects are also used to impose logical structure, like paragraphs and article threads. And there are more layers still.

Each of these layers of abstraction has its own independent set of rules. For example, what constitutes a legal file format may not contain any useful objects. The COS Object Tree may contain many objects that do not contribute to the document display or are completely unintelligible to Acrobat, but still form a legal object tree.

Knowing how to navigate these structures is essential to any PDF related development effort. But what does a real document look like on the COS Object level? What are these objects and what is really necessary to make a PDF Document? In the following text the structure of a real PDF Document is laid bare.

PDF File Structure:

Don't let this next section discourage you. It's an introduction to the file format. The infinitely more understandable PDF object structure follows.

The PDF File Format is text with some binary data mixed in. If you open it in a text editor you'll see the raw objects that define the structure and content of the document. Explicit object definitions are prefixed with some text that looks like this '12 0 obj' , the number 12 is the object reference. The object defined here is called indirect since it can be referenced by its number. You will also see objects without this reference prefix. These objects are called direct objects and are always contained inside other objects. A container object that references another object does so with the syntax '12 0 R' , to include the previous object defined with '12 0 obj'. There are only 8 low level, or COS, object types.

The first 5 are scalar (single value) types:
  1. Integer - in the file as a number without a decimal point.
  2. Boolean - in the file as the text 'true' or 'false'.
  3. Real Number - in the file as a number with a decimal point.
  4. Name - in the file as '/text' i.e. a forward slash, '/', followed by some text, no white space or punctuation allowed.
  5. String - in the file as either '(...characters...)' or '<...hexadecimal character codes...>' .
The next 3 are container types:
  1. Dictionary - in the file as '<<...other objects...>>'. Dictionary entries are always in pairs, a Name Object followed by any other object type.
  2. Array - in the file as '[...other objects...]'. A list of un-delimited objects separated by white space only where necessary.
  3. Stream - in the file as '20 0 obj<<...stream attribute objs...>>stream...binary data...endstream'. This is the most complex type. It's actually a Dictionary Object mated with a string a bytes. The Dictionary contains information necessary for accessing the data in the string of bytes. Streams are always indirect objects, so they always begin with an object reference.



PDF In-Depth Free Product Trials Ubiquitous PDF

Ubiquitous PDF: DIY PDF magazines, courtesy of CNET and Magazinify

Thanks to Magazinify.com, it's possible to have web articles delivered right to your inbox in PDF form. If that weren't enough, the nice folks at CNET have been nice enough to publish a step-by-step guide about how to set this all up using just a little time and a free Magazinify account.

September 06, 2011
Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popluar Articles
Features

Collating PDFs using JavaScript

Despite the numerous benefits, there can be potential issues with the conversion of paper documents into electronic archives. When scanning paper pages into PDF, it's possible to end up with the odd- and even-numbered pages in separate PDF files. It can be very time-consuming to collate them manually, but there is an easier way. Sean Stewart explains.

Featured Product
Platinum Sponsor

Debenu - Unrivaled PDF Productivity | PDF Library, Acrobat Plug-Ins

Create & Edit PDF - Nitro PDF Software

Silver Sponsors

LockLizard DRM PDF Security Quick PDF Library: The Unrivaled PDF Developer Toolkit