PDF In-Depth

Navigating the Internal Structure of a PDF Document

About the Author
 

Thom Parker

WindJack Solutions founder Thom Parker has been developing solutions for Adobe software since 1997. WindJack Solutions' focus is PDF development and Thom is actively involved in all things PDF. He developed two innovative tools for PDF developers and users, PDF CanOpener and...  More


 

 
 

A PDF Document is a many layered thing, that is, it has many layers of abstraction. You can look at PDF from different perspectives, each with its own advantages and disadvantages. At the lowest level, the PDF File contains the raw document data. Next up, the COS Layer organizes this data into a tree of simple objects. At the PD layer, these simple objects are put together to implement useful intermediate level structures like Fonts and Images. These are in turn organized into higher level constructs like Annotations and Pages. Some of these objects are also used to impose logical structure, like paragraphs and article threads. And there are more layers still.

Each of these layers of abstraction has its own independent set of rules. For example, what constitutes a legal file format may not contain any useful objects. The COS Object Tree may contain many objects that do not contribute to the document display or are completely unintelligible to Acrobat, but still form a legal object tree.

Knowing how to navigate these structures is essential to any PDF related development effort. But what does a real document look like on the COS Object level? What are these objects and what is really necessary to make a PDF Document? In the following text the structure of a real PDF Document is laid bare.

PDF File Structure:

Don't let this next section discourage you. It's an introduction to the file format. The infinitely more understandable PDF object structure follows.

The PDF File Format is text with some binary data mixed in. If you open it in a text editor you'll see the raw objects that define the structure and content of the document. Explicit object definitions are prefixed with some text that looks like this '12 0 obj' , the number 12 is the object reference. The object defined here is called indirect since it can be referenced by its number. You will also see objects without this reference prefix. These objects are called direct objects and are always contained inside other objects. A container object that references another object does so with the syntax '12 0 R' , to include the previous object defined with '12 0 obj'. There are only 8 low level, or COS, object types.

The first 5 are scalar (single value) types:
  1. Integer - in the file as a number without a decimal point.
  2. Boolean - in the file as the text 'true' or 'false'.
  3. Real Number - in the file as a number with a decimal point.
  4. Name - in the file as '/text' i.e. a forward slash, '/', followed by some text, no white space or punctuation allowed.
  5. String - in the file as either '(...characters...)' or '<...hexadecimal character codes...>' .
The next 3 are container types:
  1. Dictionary - in the file as '<<...other objects...>>'. Dictionary entries are always in pairs, a Name Object followed by any other object type.
  2. Array - in the file as '[...other objects...]'. A list of un-delimited objects separated by white space only where necessary.
  3. Stream - in the file as '20 0 obj<<...stream attribute objs...>>stream...binary data...endstream'. This is the most complex type. It's actually a Dictionary Object mated with a string a bytes. The Dictionary contains information necessary for accessing the data in the string of bytes. Streams are always indirect objects, so they always begin with an object reference.



Related Products at PDF Store

PDFlib TET

Reliably extract text from any PDF file with this library/component. TET makes available the text co... View full product details
Download free demo

ARTS PDF Stamper

Insert text, page numbers, bates numbering, headers and footers to your PDFs. With ARTS PDF Stamper ... View full product details
Download free demo

ARTS PDF Split & Merge Lite

The easiest way to split and merge PDFs! It provides a simpler method of splitting and merging your ... View full product details
Download free demo

PDF In-Depth Free Product Trials Ubiquitous PDF

Nitro PDF Professional

the perfect PDF product for business and enterprise, combining an extremely competitive price with a...

Download free demo

XpdfViewer

This ActiveX control (OCX) provides a PDF file viewer component, enabling developers to add PDF viewing...

Download free demo

Ubiquitous PDF: PDF eBooks-Library

If you are looking for a good store of PDF content, you could do a lot worse than visiting eBooks-Library.com...

September 03, 2009
Search Planet PDF
more searching options...







Download PDF Creator

Download The Best of Planet PDF volume 2
Planet PDF Newsletter
Most Popluar Articles
Features

Collating PDFs using JavaScript

Despite the numerous benefits, there can be potential issues with the conversion of paper documents into electronic archives. When scanning paper pages into PDF, it's possible to end up with the odd- and even-numbered pages in separate PDF files. It can be very time-consuming to collate them manually, but there is an easier way. Sean Stewart explains.

Featured Product

BCL easyPDF SDK

BCL easyPDF SDK is a set of PDF Programming Libraries designed specifically to help Software Developers / Programmers build and deploy enterprise class PDF applications for corporate wide PDF...

Platinum Sponsor
Create & Edit PDF - Nitro PDF Software

ARTS PDF

Silver Sponsors

PDF-Tools enfocus

QuickPDF: The Unrivaled PDF Developer Toolkit