PDF In-Depth

OCR: Yuor Sarech Enigne Cna't Raed Tihs

July 27, 2005


The limitations of relying on text searching become clear when you use a search engine on OCR'd documents. OCR software has gotten much, much better, but you can still count on 20+ errors on most non-laser-printed pages. I never count on text searching to locate the smoking gun...

There's a little silliness out there on the web illustrating how you can still make sense of words if the first and last letters are intact, even if all the others are scrambled.

That only works for humans. Computers can only search the actual strings of letters. Some text-search products claim to have "fuzzy" searching capabilities. The only one that I've seen that came anywhere close to working was Excalibur (now Convera) RetrievalWare. It doesn't look like they are still marketing that aspect of it. It's expensive and takes significant tech expertise -- the implementation I used was implemented by Aspen Systems, a huge lit support contractor. Others may have had better luck than I did with the "fuzzy" features of other search tools. Firms used to pay to retype text files in an effort to clean up "dirty OCR." Surely everyone has better things to do with their lives... In addition, it's not that easy to get to the "text layer" of a PDF file to alter the underlying text.

Remember, the image layer is a picture of the letters; the text layer is the letters themselves. To a computer there is a huge difference, and on an OCR'd PDF they are not the same. When you use Find or Search, you are looking for strings of letters. If those letters are garbled, you won't find the words you are looking for.

I've long been a believer that your doc management system should allow you to both Search for a document, and also to navigate to it. Text searching works very well on electronically generated PDFs (like word processing or email files). Otherwise, I use it as "rough cut" tool, to make a big pile of documents into a number of smaller piles. PDF (and specifically some tools built into Acrobat) help a lot.

That way, once you've found a good document, you know where to find it again -- don't keep running those queries every time.

This piece originally appeared on PDFforLawyers.com, and has been reproduced with permission.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.