PDF In-Depth

OCR, PDFs, and bates-numbered documents

June 01, 2006

Advertisement

 

Optical Character Recognition (or 'OCR') is a great tool. As most of you know, when you have a scanned file it's basically just an image. Even though the image may be a document that contains words the computer regards those words as pixels that it displays. A word-processing file, by contrast, is an assemblage of characters that the computer can recognize as such, which is why you can word search a text-based document but not an scanned image. Unless, you OCR the image file.

When you tell the computer to do OCR you are asking it to do something very sophisticated. The computer has to analyze each assemblage of pixels to determine what character that assemblage might be. The cleaner the pixels the better chance the computer will guess right when it decides what character it is.

Adobe Acrobat has long had an OCR function, but in prior versions it was called "paper capture." Acrobat 6.0 was the first version that, to my mind, handled OCR reliably. Acrobat 7.0 does an even better job, although it introduces other quirks in other areas that I'm not crazy about. In any event, the OCR/Paper Capture function is a great tool in Acrobat because it keeps the image file intact but identifies the characters in the image file so that you can search across a document set for key words. Obviously, this is a nice tool for litigators who deal with document productions. And, so even though it takes a fair amount of time to OCR a document (approximately 15 seconds per page, more or less), it's often worthwhile. Which brings me to reader mail.

Today I got a great question from a reader about a problem he had when he ran the OCR function in Acrobat 6.0:

One question regarding the OCR function -- have you come across the problem where part of a scanned page had "renderable text" but the remainder does not? Apparently Acrobat 6.0 decides that it cannot OCR the remainder of the page, a dialog box appears acknowledging the problem, and you either cancel the OCR or move on to the next page.

This seems to have happened to me in one production where the Bates ranges are text, but the rest of the page is scanned. I'd assume this is because the documents were scanned and then some program like Easy Bates or something similar applied a Bates range to the PDF.

Any thoughts?

I have indeed had problems OCRing documents once I did something to them (like bates-stamping the documents electronically). In other words, OCR works best if done right after you've scanned them. And of course, as I said before, it works best on clean copies (i.e. fax copies are usually not going to give you good results). So, if you intend to OCR your PDF documents then it's best to do that first, and then apply the bates-stamp. Of course, if any of you readers out there have other observations please share them in the comments section. Thanks.

This piece originally appeared on PDFforLawyers.com, and has been reproduced with permission.

Related Products at PDF Store

ARTS PDF Aerialist

Take Acrobat to the next level with advanced splitting and merging; flexible bookmark creation and m... View full product details
Download free demo

ARTS PDF Stamper

Insert text, page numbers, bates numbering, headers and footers to your PDFs. With ARTS PDF Stamper ... View full product details
Download free demo

Nitro PDF Professional

Nitro PDF Professional, your PDF creation and editing product. Priced at $99, Nitro PDF Pro is the m... View full product details
Download free demo

PDF In-Depth Free Product Trials Ubiquitous PDF

Pitstop Pro

Now graphic arts professionals have even broader and more expert control over their PDF documents. With...

Download free demo

ARTS PDF Aerialist

The ultimate plug-in for Adobe Acrobat and #1 selling product at PDF Store. Advanced splitting, merging,...

Download free demo

Ubiquitous PDF: PDF eBooks-Library

If you are looking for a good store of PDF content, you could do a lot worse than visiting eBooks-Library.com...

September 03, 2009
Search Planet PDF
more searching options...
Download The Best of Planet PDF volume 2
Planet PDF Newsletter
Most Popluar Articles
Features

How to Create Slide Shows and Self-running Kiosks in Acrobat

In this tutorial, Ted Padova and Wendy Halderman explain how to best use the features of Acrobat 6 Professional to create a self-running multi-media kiosk for use with displays such as tradeshow exhibits.

Featured Product

NITRO PDF Professional

Built from the ground up, the perfect desktop PDF product for business and enterprise. Nitro PDF Professional has an uncompromising feature set so you can create, combine, edit, collaborate on and...

Platinum Sponsor
Create & Edit PDF - Nitro PDF Software

ARTS PDF

Silver Sponsors

PDF-Tools enfocus

QuickPDF: The Unrivaled PDF Developer Toolkit