PDF In-Depth

OCR, PDFs, and bates-numbered documents

June 01, 2006


Optical Character Recognition (or 'OCR') is a great tool. As most of you know, when you have a scanned file it's basically just an image. Even though the image may be a document that contains words the computer regards those words as pixels that it displays. A word-processing file, by contrast, is an assemblage of characters that the computer can recognize as such, which is why you can word search a text-based document but not an scanned image. Unless, you OCR the image file.

When you tell the computer to do OCR you are asking it to do something very sophisticated. The computer has to analyze each assemblage of pixels to determine what character that assemblage might be. The cleaner the pixels the better chance the computer will guess right when it decides what character it is.

Adobe Acrobat has long had an OCR function, but in prior versions it was called "paper capture." Acrobat 6.0 was the first version that, to my mind, handled OCR reliably. Acrobat 7.0 does an even better job, although it introduces other quirks in other areas that I'm not crazy about. In any event, the OCR/Paper Capture function is a great tool in Acrobat because it keeps the image file intact but identifies the characters in the image file so that you can search across a document set for key words. Obviously, this is a nice tool for litigators who deal with document productions. And, so even though it takes a fair amount of time to OCR a document (approximately 15 seconds per page, more or less), it's often worthwhile. Which brings me to reader mail.

Today I got a great question from a reader about a problem he had when he ran the OCR function in Acrobat 6.0:

One question regarding the OCR function -- have you come across the problem where part of a scanned page had "renderable text" but the remainder does not? Apparently Acrobat 6.0 decides that it cannot OCR the remainder of the page, a dialog box appears acknowledging the problem, and you either cancel the OCR or move on to the next page.

This seems to have happened to me in one production where the Bates ranges are text, but the rest of the page is scanned. I'd assume this is because the documents were scanned and then some program like Easy Bates or something similar applied a Bates range to the PDF.

Any thoughts?

I have indeed had problems OCRing documents once I did something to them (like bates-stamping the documents electronically). In other words, OCR works best if done right after you've scanned them. And of course, as I said before, it works best on clean copies (i.e. fax copies are usually not going to give you good results). So, if you intend to OCR your PDF documents then it's best to do that first, and then apply the bates-stamp. Of course, if any of you readers out there have other observations please share them in the comments section. Thanks.

This piece originally appeared on PDFforLawyers.com, and has been reproduced with permission.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.