PDF In-Depth

Navigating the Internal Structure of a PDF Document

About the Author
 

Thom Parker

WindJack Solutions founder Thom Parker has been developing solutions for Adobe software since 1997. WindJack Solutions' focus is PDF development and Thom is actively involved in all things PDF. He developed two innovative tools for PDF developers and users, PDF CanOpener and...  More


 

 
 

A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look at PDF from different perspectives, each with its own advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS Layer organizes this data into a tree of simple objects. At the PD layer, these simple objects are put together to implement useful intermediate level structures like Fonts and Images. These are in turn organized into higher level constructs like Annotations and Pages. Some of these objects are also used to impose logical structure, like paragraphs and article threads. And there are more layers still.

Each of these layers of abstraction has its own independent set of rules. For example, what constitutes a legal file format may not contain any useful objects. The COS Object Tree may contain many objects that do not contribute to the document display or are completely unintelligible to Acrobat, but still form a legal object tree.

Knowing how to navigate these structures is essential to any PDF related development effort. But what does a real document look like on the COS Object level? What are these objects and what is really necessary to make a PDF Document? In the following text the structure of a real PDF Document is laid bare.

PDF File Structure:

Don't let this next section discourage you. It's an introduction to the file format. The infinitely more understandable PDF object structure follows.

The PDF File Format is text with some binary data mixed in. If you open it in a text editor you'll see the raw objects that define the structure and content of the document. Explicit object definitions are prefixed with some text that looks like this '12 0 obj' , the number 12 is the object reference. The object defined here is called indirect since it can be referenced by its number. You will also see objects without this reference prefix. These objects are called direct objects and are always contained inside other objects. A container object that references another object does so with the syntax '12 0 R' , to include the previous object defined with '12 0 obj'. There are only 8 low level, or COS, object types.

The first 5 are scalar (single value) types:
  1. Integer - in the file as a number without a decimal point.
  2. Boolean - in the file as the text 'true' or 'false'.
  3. Real Number - in the file as a number with a decimal point.
  4. Name - in the file as '/text' i.e. a forward slash, '/', followed by some text, no white space or punctuation allowed.
  5. String - in the file as either '(...characters...)' or '<...hexadecimal character codes...>' .
The next 3 are container types:
  1. Dictionary - in the file as '<<...other objects...>>'. Dictionary entries are always in pairs, a Name Object followed by any other object type.
  2. Array - in the file as '[...other objects...]'. A list of un-delimited objects separated by white space only where necessary.
  3. Stream - in the file as '20 0 obj<<...stream attribute objs...>>stream...binary data...endstream'. This is the most complex type. It's actually a Dictionary Object mated with a string a bytes. The Dictionary contains information necessary for accessing the data in the string of bytes. Streams are always indirect objects, so they always begin with an object reference.



PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Two Passwords Are Better Than One: The Low-Down On PDF Security

For people who don't spend their time looking at PDF files in text editors*, PDF security is a sometimes misunderstood beast.

For example, those document restrictions that PDF files sometimes have -- no Printing, Content Copying, Page Extraction, etc -- are essentially useless unless the PDF also has a User Password.

January 09, 2014
Platinum Sponsor



Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.