Google to index scanned PDF documents

November 06, 2008


Google has implemented technology that will allow it to index the full text of scanned PDF documents. In the past, such documents were rarely indexed at all.

While the search giant has provided full-text indexing of PDF documents for some time, scanned documents posed special problems. PDFs come in various flavors, including text only, image plus text and image only. The first two are created when PDF documents are created directly from an electronic source such as a Word document. As they already include text, they are relatively easy to index.

By contrast, image only PDFs are typically created by scanning paper documents. Computers may not recognize text in such documents: while the resulting PDFs look like the printed originals, they are in fact flat images without any textual content.

As Evin Levey, Google Product Manager, put it in the original blog post:

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter "O", just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.

In order to index the document's text, Optical Character Recognition (OCR) needs to be performed. OCR is the process of comparing the "images" on screen with characters in a database to determine which shapes represent text. Once complete, this allows the document's text to be properly indexed.

Google has updated its system and commenced indexing. For more information and examples, check out Levey's original blog post.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.