Google makes finding, converting PDF to HTML even easier
New search options include by specific file format

13 November 2001

Earlier this year, Google became the first major Internet search portal to offer full-text indexing and searchability of Acrobat PDF files. It also provided as part of its search results a link to automated conversions -- to text -- of most PDF-based content, reportedly more than 22 million files.

Google Search by format

Recently the search engine expanded the number of non-standard file types it can index, further opening up a previously hidden trove of Internet-based content, commonly referred to as the "Invisible Web." The 12 main file types currently searched by Google (in addition to standard Web-formatted documents in HTML) are:

  • Adobe Portable Document Format (pdf)
  • Adobe PostScript (ps)
  • Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
  • Lotus WordPro (lwp)
  • MacWrite (mw)
  • Microsoft Excel (xls)
  • Microsoft PowerPoint (ppt)
  • Microsoft Word (doc)
  • Microsoft Works (wks, wps, wdb)
  • Microsoft Write (wri)
  • Rich Text Format (rtf)
  • Text (ans, txt)

Among the enhanced features is the ability to easily and automatically search by specific filetype (shown in dropbox above) -- for example, searching only for PDF files, based on other user-supplied keywords, descriptive information or available search options.

Google search result

In Google's search results report, the link to a plain text version of a PDF file's content has been replaced with a link to "View as HTML," an optional method for viewing Google's own "bare bones" version of content in non-native formats. Storing a copy on its own servers is intended to guard against possible virus or worm infections in the original documents.

So how good is Google's automated PDF-to-HTML conversion?

We didn't take the time to conduct a thorough set of tests with a variety of documents, but we did choose one PDF for examination -- a teacher's guide to Picasso: The Early Years [PDF: 276kb] from the National Gallery of Art in Washington, D.C. -- that can pose a challenge for automated conversions: the text is formatted into three vertical columns per page.

Results: Regarding the technical quality of the conversion itself, fairly impressive. It followed the text across the multiple columns very well and the results to contain few mistakes. It maintained active links in the file, as well as italic and bold text styles. [NOTE -- Google converts only the text, not any graphics or formatting.]

You can do more with commercial conversion tools, but for a quick-and-dirty conversion of a PDF to HTML, it's worth a try. The trick, of course, is to get your PDF document indexed by Google's search spider so it appears in its listings.

MORE INFO


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.