Google makes finding, converting PDF to HTML even easier
New search options include by specific file format

13 November 2001

Earlier this year, Google became the first major Internet search portal to offer full-text indexing and searchability of Acrobat PDF files. It also provided as part of its search results a link to automated conversions -- to text -- of most PDF-based content, reportedly more than 22 million files.

Google Search by format

Recently the search engine expanded the number of non-standard file types it can index, further opening up a previously hidden trove of Internet-based content, commonly referred to as the "Invisible Web." The 12 main file types currently searched by Google (in addition to standard Web-formatted documents in HTML) are:

  • Adobe Portable Document Format (pdf)
  • Adobe PostScript (ps)
  • Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
  • Lotus WordPro (lwp)
  • MacWrite (mw)
  • Microsoft Excel (xls)
  • Microsoft PowerPoint (ppt)
  • Microsoft Word (doc)
  • Microsoft Works (wks, wps, wdb)
  • Microsoft Write (wri)
  • Rich Text Format (rtf)
  • Text (ans, txt)

Among the enhanced features is the ability to easily and automatically search by specific filetype (shown in dropbox above) -- for example, searching only for PDF files, based on other user-supplied keywords, descriptive information or available search options.

Google search result

In Google's search results report, the link to a plain text version of a PDF file's content has been replaced with a link to "View as HTML," an optional method for viewing Google's own "bare bones" version of content in non-native formats. Storing a copy on its own servers is intended to guard against possible virus or worm infections in the original documents.

So how good is Google's automated PDF-to-HTML conversion?

We didn't take the time to conduct a thorough set of tests with a variety of documents, but we did choose one PDF for examination -- a teacher's guide to Picasso: The Early Years [PDF: 276kb] from the National Gallery of Art in Washington, D.C. -- that can pose a challenge for automated conversions: the text is formatted into three vertical columns per page.

Results: Regarding the technical quality of the conversion itself, fairly impressive. It followed the text across the multiple columns very well and the results to contain few mistakes. It maintained active links in the file, as well as italic and bold text styles. [NOTE -- Google converts only the text, not any graphics or formatting.]

You can do more with commercial conversion tools, but for a quick-and-dirty conversion of a PDF to HTML, it's worth a try. The trick, of course, is to get your PDF document indexed by Google's search spider so it appears in its listings.


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.