Google Docs might best be known as a cloud-based tool for word processing, spreadsheets and presentations, but it is also becoming quite useful to use with PDF files as well.
No only does Google Docs include a Download as PDF option which allows you to save your text documents, spreadsheets and presentations as PDF, but it also lets you upload PDF files to the Google Docs file management system so that you can organize and share your PDF files along with your other Google Docs files.
Recently Google has also introduced an exciting option to convert scanned PDFs — so PDF files which have been scanned from paper to an image and then converted to a PDF — to text based documents via optical character recognition (OCR) technology when the PDF files are uploaded.
The OCR'ing process is presumably powered by Tesseract, an open source software project which has been in development since 1995 and which Google has been sponsoring since 2006.
It is the conversion of scanned PDFs to text documents that this article will take a closer look at.
How to convert scanned PDFs to text documents
In order to have your scanned PDF files converted to a text document all you need to do is open your Google Docs account and click on the icon highlighted below, then click on Files, select your files and then click on the Open button.
Figure 1. Upload file(s)
The window shown below will now pop up. Make sure that the Convert text from PDF and image files to Google documents option is selected. This is the option that is going to convert your scanned PDF files to text documents. Click on Start upload.
Figure 2. Upload settings
The uploading and OCR'ing process will now begin. The below window should be displayed to the bottom right of your screen and will show you the status of the uploading.
Figure 3. Upload status
Once Google Docs has finished doing its thing the new minted text document will be shown in the file list. Click on it to open it and take a look.
Converting a text-based PDF — that is a PDF which uses text objects and can also include image objects and other object types — to a Microsoft Word document is notoriously hard. So you can imagine that converting a scanned PDF, which does not include any actual text objects, to a Google Document, is going to be similarly hard — if not harder because of the OCR'ing required.
I found that Google Docs was quite good as OCR'ing the scanned PDF and copying the text into the text document, but that it had a hard-time maintaining the original look and feel (read layout and formatting) of the original scanned PDF. But this isn't surprising, like I mentioned before, it's notoriously hard to do this and people currently pay a lot of money for inferior solutions.
If you use Google Docs as a tool to get text from scanned PDF files so that it can be copied and pasted elsewhere then I think you'll find it very useful, but if you're looking to replicate a scanned PDF as a text document with the exact same look and feel, it will still require quite a bit of manual tinkering to get the final output looking right.
Debenu, released version 10 of their pioneering PDF SDK, Debenu Quick PDF Library. Available in single and multi licenses for Windows, Mac and Server development. In an important step forward for the SDK a range of great new innovative features and enhancements have been included.
OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.