News

Google to index scanned PDF documents

November 06, 2008

Advertisement
Advertisement
 

Google has implemented technology that will allow it to index the full text of scanned PDF documents. In the past, such documents were rarely indexed at all.

While the search giant has provided full-text indexing of PDF documents for some time, scanned documents posed special problems. PDFs come in various flavors, including text only, image plus text and image only. The first two are created when PDF documents are created directly from an electronic source such as a Word document. As they already include text, they are relatively easy to index.

By contrast, image only PDFs are typically created by scanning paper documents. Computers may not recognize text in such documents: while the resulting PDFs look like the printed originals, they are in fact flat images without any textual content.

As Evin Levey, Google Product Manager, put it in the original blog post:

To people reading these documents, the distinction between words and pictures of words makes little difference, but for a computer the picture is almost unintelligible. Consider a circle. Should it be read it as a zero, the letter "O", just a circle, or the ring from my coffee cup? People learn to answer this kind of question very quickly, but for the computer it is a painstaking and error-prone process.

In order to index the document's text, Optical Character Recognition (OCR) needs to be performed. OCR is the process of comparing the "images" on screen with characters in a database to determine which shapes represent text. Once complete, this allows the document's text to be properly indexed.

Google has updated its system and commenced indexing. For more information and examples, check out Levey's original blog post.

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Quick PDF Library

Get products to market faster with this amazing PDF developer SDK. Over 900 functions and an equally...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.