Google makes finding, converting PDF to HTML even easier
New search options include by specific file format
13 November 2001
Earlier this year, Google became the first major Internet search portal to offer full-text indexing and searchability of Acrobat PDF files. It also provided as part of its search results a link to automated conversions -- to text -- of most PDF-based content, reportedly more than 22 million files.
Recently the search engine expanded the number of non-standard file types it can index, further opening up a previously hidden trove of Internet-based content, commonly referred to as the "Invisible Web." The 12 main file types currently searched by Google (in addition to
standard Web-formatted documents in HTML) are:
- Adobe Portable Document Format (pdf)
- Adobe PostScript (ps)
- Lotus 1-2-3 (wk1, wk2, wk3, wk4, wk5, wki, wks, wku)
- Lotus WordPro (lwp)
- MacWrite (mw)
- Microsoft Excel (xls)
- Microsoft PowerPoint (ppt)
- Microsoft Word (doc)
- Microsoft Works (wks, wps, wdb)
- Microsoft Write (wri)
- Rich Text Format (rtf)
- Text (ans, txt)
Among the enhanced features is the ability to easily and automatically search by specific filetype (shown in dropbox above) -- for example, searching only for PDF files, based on other user-supplied keywords, descriptive information or available search options.
In Google's search results report, the link to a plain text version of a PDF file's content has been replaced with a link to "View as HTML," an optional method for viewing Google's own "bare bones" version of content in non-native formats. Storing a copy on its own servers is intended to guard against possible virus or worm infections in the original documents.
So how good is Google's automated PDF-to-HTML conversion?
We didn't take the time to conduct a thorough set of tests with a variety of documents, but we did choose one PDF for examination -- a teacher's guide to Picasso: The Early Years [PDF: 276kb] from the National Gallery of Art in Washington, D.C. -- that can pose a challenge for automated conversions: the text is formatted into three vertical columns per page.
Results: Regarding the technical quality of the conversion itself, fairly impressive. It followed the text across the multiple columns very well and the results to contain few mistakes. It maintained active links in the file, as well as italic and bold text styles. [NOTE -- Google converts only the text, not any graphics or formatting.]
You can do more with commercial conversion tools, but for a quick-and-dirty conversion of a PDF to HTML, it's worth a try. The trick, of course, is to get your PDF document indexed by Google's search spider so it appears in its listings.