Google now includes automatic PDF-to-text conversion
Enhancement allows PDF-based content to be indexed and searched
5 February 2001
Users of the Adobe Acrobat software and portable document format (PDF) files have been aware for some time that most major Internet search engines have been oblivious to the information contained in PDF files. In short, since PDF documents couldn't be indexed, they have not been part of what most engines could search.
A number of PDF-aware indexing tools and products have become available in recent years, but for the most part, the major online search sites have continued to all but ignore information stored inside PDFs. The result: A large and growing amount of good information has remained hidden from view (and vice versa, a lot of old, useless information in other formats remains, lowering further the overall quality of search results).
Google Does PDF!
No matter which criteria you use to rank the top Internet search sites, there's no argument today that Google (www.google.com) -- with more than 13 million files indexed -- has become one of the best. In addition to its relevant matches of up-to-date content and tailored descriptions, Google's search results offer a "cached" version of most pages -- its "spider" has crawled the Internet and maintained an archive of pages ("snapshots," Google calls them) that in some cases no longer exist on the Web. The search query term is also highlighted in the results.
The latest in Google's efforts to bring more of the so-called "Invisible Web" into view: PDF files are now converted to text, making the content indexable and searchable. Results provide a link to both the standard PDF document -- indicated by the term [pdf] in blue text -- and to Google's text version. Currently the results only show the PDF document's title as "Untitled," but further updates are expected to correct this and to add other new capabilities.
Google reportedly will have the first release of its PDF search functionality installed and operating on all 7,000 of its file servers by February 5.