PDF In-Depth

OCR, PDFs, and bates-numbered documents

June 01, 2006

Advertisement
Advertisement
 

Optical Character Recognition (or 'OCR') is a great tool. As most of you know, when you have a scanned file it's basically just an image. Even though the image may be a document that contains words the computer regards those words as pixels that it displays. A word-processing file, by contrast, is an assemblage of characters that the computer can recognize as such, which is why you can word search a text-based document but not an scanned image. Unless, you OCR the image file.

When you tell the computer to do OCR you are asking it to do something very sophisticated. The computer has to analyze each assemblage of pixels to determine what character that assemblage might be. The cleaner the pixels the better chance the computer will guess right when it decides what character it is.

Adobe Acrobat has long had an OCR function, but in prior versions it was called "paper capture." Acrobat 6.0 was the first version that, to my mind, handled OCR reliably. Acrobat 7.0 does an even better job, although it introduces other quirks in other areas that I'm not crazy about. In any event, the OCR/Paper Capture function is a great tool in Acrobat because it keeps the image file intact but identifies the characters in the image file so that you can search across a document set for key words. Obviously, this is a nice tool for litigators who deal with document productions. And, so even though it takes a fair amount of time to OCR a document (approximately 15 seconds per page, more or less), it's often worthwhile. Which brings me to reader mail.

Today I got a great question from a reader about a problem he had when he ran the OCR function in Acrobat 6.0:

One question regarding the OCR function -- have you come across the problem where part of a scanned page had "renderable text" but the remainder does not? Apparently Acrobat 6.0 decides that it cannot OCR the remainder of the page, a dialog box appears acknowledging the problem, and you either cancel the OCR or move on to the next page.

This seems to have happened to me in one production where the Bates ranges are text, but the rest of the page is scanned. I'd assume this is because the documents were scanned and then some program like Easy Bates or something similar applied a Bates range to the PDF.

Any thoughts?

I have indeed had problems OCRing documents once I did something to them (like bates-stamping the documents electronically). In other words, OCR works best if done right after you've scanned them. And of course, as I said before, it works best on clean copies (i.e. fax copies are usually not going to give you good results). So, if you intend to OCR your PDF documents then it's best to do that first, and then apply the bates-stamp. Of course, if any of you readers out there have other observations please share them in the comments section. Thanks.

This piece originally appeared on PDFforLawyers.com, and has been reproduced with permission.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.