PDF In-Depth

OCR: Yuor Sarech Enigne Cna't Raed Tihs

July 27, 2005

Advertisement
Advertisement
 

The limitations of relying on text searching become clear when you use a search engine on OCR'd documents. OCR software has gotten much, much better, but you can still count on 20+ errors on most non-laser-printed pages. I never count on text searching to locate the smoking gun...

There's a little silliness out there on the web illustrating how you can still make sense of words if the first and last letters are intact, even if all the others are scrambled.

That only works for humans. Computers can only search the actual strings of letters. Some text-search products claim to have "fuzzy" searching capabilities. The only one that I've seen that came anywhere close to working was Excalibur (now Convera) RetrievalWare. It doesn't look like they are still marketing that aspect of it. It's expensive and takes significant tech expertise -- the implementation I used was implemented by Aspen Systems, a huge lit support contractor. Others may have had better luck than I did with the "fuzzy" features of other search tools. Firms used to pay to retype text files in an effort to clean up "dirty OCR." Surely everyone has better things to do with their lives... In addition, it's not that easy to get to the "text layer" of a PDF file to alter the underlying text.

Remember, the image layer is a picture of the letters; the text layer is the letters themselves. To a computer there is a huge difference, and on an OCR'd PDF they are not the same. When you use Find or Search, you are looking for strings of letters. If those letters are garbled, you won't find the words you are looking for.

I've long been a believer that your doc management system should allow you to both Search for a document, and also to navigate to it. Text searching works very well on electronically generated PDFs (like word processing or email files). Otherwise, I use it as "rough cut" tool, to make a big pile of documents into a number of smaller piles. PDF (and specifically some tools built into Acrobat) help a lot.

That way, once you've found a good document, you know where to find it again -- don't keep running those queries every time.

This piece originally appeared on PDFforLawyers.com, and has been reproduced with permission.

Related Products at PDF Store

Nitro PDF Professional

Nitro PDF Professional, your PDF creation and editing product. Priced at $99, Nitro PDF Pro is the m... View full product details
Download free demo

ARTS PDF Crackerjack

Impose pages, automate your workflow, verify certified PDFs, print accurate colour separations, conv... View full product details
Download free demo

ARTS PDF Aerialist

Take Acrobat to the next level with advanced splitting and merging; flexible bookmark creation and m... View full product details
Download free demo

PDF In-Depth Free Product Trials Ubiquitous PDF

Nitro PDF Professional

the perfect PDF product for business and enterprise, combining an extremely competitive price with a...

Download free demo

XpdfViewer

This ActiveX control (OCX) provides a PDF file viewer component, enabling developers to add PDF viewing...

Download free demo

Ubiquitous PDF: PDF eBooks-Library

If you are looking for a good store of PDF content, you could do a lot worse than visiting eBooks-Library.com...

September 03, 2009
Search Planet PDF
more searching options...







Create PDF Free

Most Popluar Articles
Planet PDF Newsletter
Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.

Featured Product

Docmetrics

Generate more, higher-quality sales leads from your PDF marketing content. Docmetrics is a web-based system that lets you capture previously unavailable reader data. Free trial.

Platinum Sponsor
Create & Edit PDF - Nitro PDF Software

ARTS PDF

Silver Sponsors

PDF-Tools enfocus

QuickPDF: The Unrivaled PDF Developer Toolkit