ScanSoft Inc.'s TextBridge
9.0 Business Edition (BE)
A paper-to-PDF Alternative
to Adobe Acrobat Capture?
By Duff Johnson, Planet PDF Contributing Editor
Adobe Systems introduced paper to PDF conversion four and a half years ago with Acrobat Capture. While never an Optical Character Recognition (OCR) shop, Adobe's first OCR product went places that, arguably, it took Adobe to think of.
Along with the revolutionary "Image + Text" concept, Adobe introduced Acrobat Capture Reviewer, a tool for high-end paper to electronic conversion that includes font-management, kerning, line and page layout formatting for an output file type -- Adobe cursed with the name "PDF/ Normal" -- to emphasize the similarity with "real" electronic-source PDF files. Adobe's goal was, in effect, to reverse the printing process - to make digital documents from paper.
OCR developers, on the other hand, had tended to focus solely on text recognition, with image treatment or page layout as secondary - or irrelevant - distractions from the holy grail of text accuracy. Since the release of Acrobat Capture 2.01 in Q2, 1998, most major OCR developers have sought to add PDF output to their products, with varying degrees of success. ScanSoft's Q3, 1999 release of TextBridge Pro 9.0 Business Edition, reveals important progress with PDF output, and highlights the many challenges for developers working to produce serious alternatives to Capture. A side note: my money is on the first vendor who ports their engine to Linux, regardless.
Features & Issues
Always the sine qua non of OCR engines, ScanSoft's latest engine is superb in this area. We generated superb results on virtually all image types, and in a wide range of resolutions. Fax-quality images gave decent results, we scored astounding accuracy on carefully scanned 300-dpi book-page images. The OCR is trainable and may be zoned manually, which itself can mean the difference between 99.5 percent and 99.9 percent accuracy -- in other words, a big difference.
A valuable tool in OCR, TextBridge 9.0 BE makes evident how critical operator-enhanced zoning is for PDF conversion. TB's excellent table zone tool, as well as support for image and text zones on output, has the potential to provide real competition for Acrobat Capture Reviewer, currently the industry's one and only shrink-wrapped tool for correcting not only OCRed text but formatting, image handling, page layouts and vector graphics.
Capture's engine performs the OCR, then outputs intermediate ACD files, which may be converted directly to PDF, or distributed to Acrobat Capture Reviewer correction stations. TextBridge is a workstation-based approach wherein all processing occurs at the workstation. In version 9.0, ScanSoft added a batch-handling utility, called Scheduler, to TextBridge. One can argue the relative merits of the centralized vs. distributed processing model, but either way, Capture retains the edge for batch handling.
While welcome, TextBridge's Scheduler is limited to processing a directory of files at a time, and batch settings cannot be saved (not that many options are offered to begin with). Engine
reliability is an unsung key feature of OCR engines with pretensions to volume processing, and TextBridge does crash significantly more often, on average, than a healthy installation of
Acrobat Capture 2.01. Neither application, however, is equipped with adequate logging and exception handling functionality. In 5 trials, we could not get a single overnight run of several
thousand pages to complete - TextBridge crashed within 250 pages of the start-line every time. Acrobat Capture beats that performance three out of five times.
The batch handling in Capture 2.01 is unambitious, but in a way, that is also its charm. Acrobat Capture allows the user to set and save complete (almost) conversion settings for each
workflow - itself invaluable to users with more than 1 project!
The Output - Proof is in the Pudding
Like other OCR vendors, ScanSoft has not licensed Adobe's PDF libraries for use in its
product, but instead uses its own encoding. This has both positive and negative aspects.
By not using Adobe's libraries, ScanSoft appears forced to make a familiar, and
unfortunate, compromise for PDF/ Image+ Text; the image is occulted by the text. In order to
display the PDF page correctly on screen, the text has to be "transparent" to allow viewing of
the image. Depending on your printer, "transparent" can mean "white" - which is no
good to anyone. A related problem, screen refresh times are slower with TextBridge's
Image+ Text files, although the files are generally quite a bit smaller than those
generated with Capture.
Using Adobe's own code, Acrobat Capture still produces the "reference standard" in Image+ Text files, because the text is truly "in back of" the image, assuring the same appearance both on-screen and the printed page. Of course, only Capture-generated Image+ Text files include at no extra charge the annoying "thin white horizontal lines" phenomenon - sometimes visible on-screen, but gone in print.
On the plus side is ScanSoft's decision to generate fonts normalized to the Acrobat 3x standard base 14 Type 1 fonts, improving performance and reducing "file bloat" in smaller documents. Unsophisticated conversions with Capture generally result in up to 30% of total file sizes given over to font data needlessly embedded in each and every PDF! TextBridge-generated files are cleaner from the ground up, superior to Acrobat Capture in this regard.
In PDF/ Normal files, OCR "suspects" may be left on the PDF page as bitmap images to "conceal" the OCR error, if any. We found that the alignment of this word-suspect image and text data quite unsuccessful in most of the pages we tried with TextBridge. If suspect correction is a part of your PDF/ Normal process, however, you may not care about this, since you won't have any suspects anyway.
The other major flaw in TextBridge's PDF/ Normal output is in image-handling. Regardless of input, TB delivers an image zone in 8-bit grayscale (or color), at 100 dpi. Feed it a 300 dpi CCITT G4 image, get 100 dpi grayscale in return (with attendant chunky file size). Without getting into the image readability debate, let's just say that the operator ought to be able to select what happens to their input image! With this enhancement, TextBridge 9.0 BE would be a credible - and fast - PDF/ Normal generator for many types of documents, with nary a dongle click.
The main question for TextBridge is, can the format issues in Image+ Text files be addressed? Licensing the Adobe libraries may be part of the price for a truly PDF-enabled OCR engine - a strategic decision, I guess. The success of PDF was eventually going to mean competition in the paper to PDF conversion software business - which Adobe invented almost as an afterthought to PDF itself. That competition has almost arrived.
Strengths & weaknesses comparison
|ScanSoft TextBridge 9.0 BE |
Zoning & the Table tool
OCR accuracy - woah!
"Sub-standard" PDF/ Image+ Text
Image handling in PDF/ Normal
No undo functionality |
|Adobe Acrobat Capture 2.01 |
"Reference Standard" PDF/ Image+ Text files
Acrobat Capture Reviewer - still the only application generating
high-quality PDF/ Normal files "out of the box"
Lack of zoning|
Samples of Adobe Acrobat Capture output formats:
TextBridge Pro 9.0 Business Edition - System Requirements
Windows 95, 98, 2000 or Windows NT 4.0
Internet Explorer 4.0 (or later) or Netscape Navigator 5.0 (or
later) are necessary for viewing formatted (WYSIWYG) HTML output.
Intel (or compatible) 486 or higher
24MB of RAM (32MB recommended)
Hard Drive Space required:
Scanner/Input Device Drivers: