PDF In-Depth

What are you putting in your PDF files?

April 19, 2011

Advertisement
Advertisement
 

Mark is the CEO of IDRsolutions, the company behind JPedal, a 100% Java PDF library. He blogs at Java PDF Blog.

PDF files are generally judged on how they appear. This is a shame because it is possible to create a well-crafted PDF (with lots of practical uses) or a horrible PDF (which is pretty useless) and the two versions will superficially look identical onscreen.

This article will explain what the difference is and how you can tell. I get to see an awful lot of PDF files in my day job developing a Java PDF viewer, so I would like to tell you about the good, the bad and the ugly.

Because PDF files are designed to be read and printed (jobs they do very well), most people judge them on their superficial appearance. The format actually lets you put just about anything inside a PDF -- images, text, vector graphics -- and what you put in can alter the flexibility of the PDF and what you can use it for.

Some PDF files contain just images. Even if they look like they contain text and shapes, these are just bit-mapped images inside a PDF. They look okay, but you cannot search them (there is no text in them) and you need an OCR tool for text extraction. They also tend to be large and they do not scale well. Trying to zoom into them results in a pixelated display. They also tend to need lots of memory. You can spot these files very easily by zooming into them or by trying to select the text (Ctrl-A). If you cannot select any of the text you can see, chances are that it's an image.

Less common, but still found, are PDF files where the text has been converted to shapes. Again these file tend to be on the large size, and you cannot search them from text. You need OCR to get text out of them.

The biggest complaint we see against the PDF file format is about text extraction. People complain it is very hard to get formatted text from a PDF file to edit. This is because PDF was originally designed as an end-file display format (unlike Word documents) and did not include any document structure.

Since PDF version 1.4, though, it has been perfectly possible to include tags in the PDF file. This allows the extraction of formatted content -- but only if the PDF was created with the tags included. Most PDF creation tools still leave them out. I wrote an article explaining how you can see if the tags are present on my blog.

I often find people complain that the PDF file format is defective, when actually the features are there -- they just have not been used properly. If we were all more "choosy" about how our PDF files were made, a lot of issues would go away. So next time you work with PDF files, remember that not all PDF files are created equal and look beyond the immediate appearance to see how well-made they are. It will have a big impact on what use can be made of the files.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.