PDF In-Depth

XML and PDF: Of applications and philosophy

July 27, 2000


Certainly XML (eXtensible Markup Language) has caught on. More and more businesses see the value of using XML. However, because XML is such a flexible data format, XML's use varies widely. In general, companies produce XML for interchange (to augment, or even replace EDI or Electronic Data Interchange), as a standard data format for integration with databases, or as a default file format from which documents can be constructed and published in a variety of formats. Some might see some conflict between PDF and XML, as both provide a standard way to present and exchange information. In combination with the eXtensible Stylesheet Language (XSL), like HTML's cascading style sheets, XML data can be precisely displayed in a browser that supports XML and XSL.

Where do XML and PDF files meet?

With the release of PDF version 1.3 (with Acrobat 4.0), Adobe provided a means to embed XML-like data into PDF files. The structure of the document can be extracted and rendered as an XML, HTML or other structured document. An important point, the structure, called a structure tree, is not a true XML document. It must be processed and further marked-up before it could be truly called an XML document. So PDF supports the embedding of structured data into a PDF document. Let's talk about:

  1. how we extract structure form a PDF document;
  2. how we add structure to a PDF document; and
  3. why we would do any of this.

1) How to extract structure from a PDF file

We'll begin with how to extract data from PDF files, as this question is far more common than how to add structure. To extract a structure tree from a PDF file, it must have an embedded structure tree within it. Not all PDF files have a structure tree. The embedded, XML-like structure tree is called logical structure. Logical structure is independent of the PDF file's representation, or how images, text and other design elements appear when viewed. We'll cover how to add the structure next, but it's important to know that there are really two ways to extract structure from a document. A document's structure is either defined using markup and embedded in the PDF file (more on this later), or implied using the document's heuristics or the way the document flows. For instance, you can determine the structure of a newspaper quite simply. There's a headline, sub-heads, a byline (or slug), the first paragraph, second paragraph, pictures (inset or otherwise), a callout, etc. These items have an order to them or they occur and are displayed in an order within the document based on how Westerners read the page.

What methods are there to extract structure from PDF files? Let's focus on extracting content from a PDF file and form a valid XML file. Here, as mentioned, there are two approaches and several solutions.

Implied Structure
If you created this PDF file using QuarkXpress or Adobe Illustrator, you probably have a document that does not have any structure in its data. Without explicitly defining your data as you do with an XML or HTML document, the structure of the data must be implied. There are a couple of companies who offer services and software that extract either XML or HTML, or both from PDF files (some examples are Iceni, Televisual, and Texterity, all are featured in the tools section on PlanetPDF.com). These services or software analyze the page's heuristics and builds structured data from the PDF file. The result is the XML file we were looking for. However, depending on the nature of the original PDF document, and by nature I mean the way it was organized visually on the page, the result may be a bit disappointing. The structure may not be correct, putting some segments of the page in the wrong order and including elements such as page numbers, footers, etc. that should be excluded. Regardless of the quality of the results, you'll get a data file you can edit by hand more quickly than copying and pasting from the PDF file yourself. Some time ago I built an application that relied heavily on Iceni's Gemini technology. The application extracted content from newspaper real estate display ads (the ones with pictures of all the houses). The system exported thousands of advertisements into a XML file that was then parsed and imported into a relational database. The process to that point was totally automated and saved hundreds of hours each day compared to other proposed solutions.

Defined Structure
The second method, which applies only if the PDF file includes a structure tree, enables direct access to the data embedded within the PDF file. Within PDF 1.3, logic structure is described in one or more structure trees. This access is accomplished via Acrobat API's PDSEdit method. As Adobe's API documentation describes, PDSEdit allows access to navigate the structure tree, search within the structure tree, and bookmark sections or specific content. Here's where these structure trees take on XML-like features. You could, for instance, search for words that are contained within the structure tree's "title" element. (Here, I'm using "element" as you might when describing an XML file). Clearly, there's some value to this. You could bookmark your PDF document based on its inherit structure (headings, figures, etc.) rather than manually adding bookmarks based on selected page elements. Likewise, you could scan the document's structure and extract headline for use in a database.

Should you want to build a custom application that extracts content from PDF files, refer to the Acrobat Core API Overview (http://partners.adobe.com/asn

2) How to add structure to a PDF file

With some understanding of how to get structured data from PDF files, you may wish to change the way in which you create your PDF files to get better results when the data is extracted. The most effective way to extract well-formed, structured data from your PDF files is to embed the data.

The PDF file then becomes a container for the structured data and the look and feel of the document (as PDF files are commonly used).

Adding structure to a PDF file is accomplished with one of two methods:

  1. applying structure through PDF markup and
  2. adding the structure programmatically, using the Acrobat API.

Remember that the logic structure of a document is independent of, yet related to, the page structure; and that the structured data does not impact how images, text boxes and other page elements appear in a PDF page.

1) To add structured data to your PDF when the PDF is created, use pdfmark. Pdfmarks are PostScript extensions that generate a variety of advanced Acrobat objects including links, bookmarks, annotations, logical structure, etc. The pdfmarks are added to the PostScript language code when the authoring application prints the document. When the Acrobat Distiller application creates the PDF file from the PostScript code, it generates the structure information added via the pdfmarks. Adding the structure requires an authoring application that permits the creation of PDF marks. More information about pdfmarks can be found in the pdfmark reference manual (http://partners.adobe.com/asn

Note that when you use Acrobat 4.0's Web Page capture, the logical data structure of the HTML document that is converted to a PDF file can be preserved. To preserve the structure of captured Web pages:

  1. Select File > Open Web Page…
  2. In the Open Web Page Dialog box, click "Conversion Settings…".
  3. In the Conversion Settings dialog box, select the Add PDF Structure checkbox.
  4. When opening a Web page, the HTML structure will be added to the PDF file.

2) As an alternative, the Acrobat API provides a means to add structured data as one or more structure trees via the Acrobat API. So, if you have a PDF file that you want to add structured data to (it need not correspond to the layout of the document), you can do so by using the Acrobat API. Further discussion about the API and the various calls used to create structure are quite technical, refer to the Acrobat Core API Overview (http://partners.adobe.com/asn

3) Why would we want to store structured data within PDF files?

For the numerous emails I received related to this topic, the interest in storing the logical structure within PDF files is primarily for repurposing the content. Publishers spend a large percentage of their time and money finding ways to convert one document's file type into another, and more time and money actually doing the file conversion. Clearly, one benefit to the logic structure of documents being maintained within PDF files is the combination of look and feel with structured data. If designers using QuarkXpress or InDesign and other desktop publishing or traditional design tools could publish electronically to a format that both maintained the integrity of their design (as PDF does) and the data structure of those elements (as XML does), then we may have found the perfect file format. Well, that sounds good, but there is more to it. Remember that XML, with the presentation style sheet XSL promises to offer similar markup capabilities as PDF, though we're not quite there yet. The advantage of XML (with XSL) is the ability, from a publisher's perspective, to separate the data (or content) from the presentation (or look and feel). PDF does a very good job at preserving the interrelationship of these two elements, and this control over maintaining look and feel, as offering portability, are two of many reasons why PDF has proven to be an excellent file format for publishers. Now if we could represent the data of our documents in a PDF file's structure tree and render it's look and feel using a PDF file's unique page description language, then we'd have something (and it would look a lot like what's been proposed with the XSL standard).

Final Thoughts

As you consider how to maintain the integrity of data for future use, the best format is a structured format and there is no better format than XML. Alternatively, you may have requirements to maintain the branding image or look and feel of your documents; here PDF is an excellent choice. How can you get the maximum value from what both XML and PDF offer? One solution could be to embed XML-like structured data in all your PDF files, and PDF 1.3 can support this. An alternative, and a method I've long advocated since my Adobe Press book Internet Publishing with Acrobat (Adobe Press, 1996), is a concurrent publishing model, where your primary data is stored as XML, and this data is then used within your design application. Careful though, as many designers want to modify the data you provide for better fit, or they simply take artistic license with the data (or copy). You'll need to assess the applicability and how well you can enforce a concurrent publishing model. To download the chapter from my book on the concurrent publishing model, visit http://www.gordonkent.com

The relationship between PDF files and XML is technically possible, however, applications that make use of this technology have yet to appear with a frequency. Certainly e-businesses see the value of XML. As they begin considering the impact of structured data when assessing their marketing materials (and other document often published as PDF files), the popularity of embedding logical structure within PDF files may increase. In the short term, the popularity of PDF files is not likely to decrease as XML gains acceptance. Instead, publishers who wish to repurpose their PDF files as structured file formats, such as HTML or XML will make use of the conversion or extraction tools available that produce structured content. And there, the relationship between XML and PDF may remain, as the relationship between PDF and HTML is today.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.