REVIEW:
ScanSoft Inc.'s TextBridge
9.0 Business Edition (BE)

A paper-to-PDF Alternative
to Adobe Acrobat Capture?

By Duff Johnson, Planet PDF Contributing Editor

Adobe Systems introduced paper to PDF conversion four and a half years ago with Acrobat Capture. While never an Optical Character Recognition (OCR) shop, Adobe's first OCR product went places that, arguably, it took Adobe to think of.

Along with the revolutionary "Image + Text" concept, Adobe introduced Acrobat Capture Reviewer, a tool for high-end paper to electronic conversion that includes font-management, kerning, line and page layout formatting for an output file type -- Adobe cursed with the name "PDF/ Normal" -- to emphasize the similarity with "real" electronic-source PDF files. Adobe's goal was, in effect, to reverse the printing process - to make digital documents from paper.

OCR developers, on the other hand, had tended to focus solely on text recognition, with image treatment or page layout as secondary - or irrelevant - distractions from the holy grail of text accuracy. Since the release of Acrobat Capture 2.01 in Q2, 1998, most major OCR developers have sought to add PDF output to their products, with varying degrees of success. ScanSoft's Q3, 1999 release of TextBridge Pro 9.0 Business Edition, reveals important progress with PDF output, and highlights the many challenges for developers working to produce serious alternatives to Capture. A side note: my money is on the first vendor who ports their engine to Linux, regardless.

Features & Issues

OCR accuracy
Always the sine qua non of OCR engines, ScanSoft's latest engine is superb in this area. We generated superb results on virtually all image types, and in a wide range of resolutions. Fax-quality images gave decent results, we scored astounding accuracy on carefully scanned 300-dpi book-page images. The OCR is trainable and may be zoned manually, which itself can mean the difference between 99.5 percent and 99.9 percent accuracy -- in other words, a big difference.

Zoning
A valuable tool in OCR, TextBridge 9.0 BE makes evident how critical operator-enhanced zoning is for PDF conversion. TB's excellent table zone tool, as well as support for image and text zones on output, has the potential to provide real competition for Acrobat Capture Reviewer, currently the industry's one and only shrink-wrapped tool for correcting not only OCRed text but formatting, image handling, page layouts and vector graphics.

Batch Handling
Capture's engine performs the OCR, then outputs intermediate ACD files, which may be converted directly to PDF, or distributed to Acrobat Capture Reviewer correction stations. TextBridge is a workstation-based approach wherein all processing occurs at the workstation. In version 9.0, ScanSoft added a batch-handling utility, called Scheduler, to TextBridge. One can argue the relative merits of the centralized vs. distributed processing model, but either way, Capture retains the edge for batch handling.

While welcome, TextBridge's Scheduler is limited to processing a directory of files at a time, and batch settings cannot be saved (not that many options are offered to begin with). Engine reliability is an unsung key feature of OCR engines with pretensions to volume processing, and TextBridge does crash significantly more often, on average, than a healthy installation of Acrobat Capture 2.01. Neither application, however, is equipped with adequate logging and exception handling functionality. In 5 trials, we could not get a single overnight run of several thousand pages to complete - TextBridge crashed within 250 pages of the start-line every time. Acrobat Capture beats that performance three out of five times.

The batch handling in Capture 2.01 is unambitious, but in a way, that is also its charm. Acrobat Capture allows the user to set and save complete (almost) conversion settings for each workflow - itself invaluable to users with more than 1 project!

The Output - Proof is in the Pudding

Like other OCR vendors, ScanSoft has not licensed Adobe's PDF libraries for use in its product, but instead uses its own encoding. This has both positive and negative aspects.

By not using Adobe's libraries, ScanSoft appears forced to make a familiar, and unfortunate, compromise for PDF/ Image+ Text; the image is occulted by the text. In order to display the PDF page correctly on screen, the text has to be "transparent" to allow viewing of the image. Depending on your printer, "transparent" can mean "white" - which is no good to anyone. A related problem, screen refresh times are slower with TextBridge's Image+ Text files, although the files are generally quite a bit smaller than those generated with Capture.

Using Adobe's own code, Acrobat Capture still produces the "reference standard" in Image+ Text files, because the text is truly "in back of" the image, assuring the same appearance both on-screen and the printed page. Of course, only Capture-generated Image+ Text files include at no extra charge the annoying "thin white horizontal lines" phenomenon - sometimes visible on-screen, but gone in print.

On the plus side is ScanSoft's decision to generate fonts normalized to the Acrobat 3x standard base 14 Type 1 fonts, improving performance and reducing "file bloat" in smaller documents. Unsophisticated conversions with Capture generally result in up to 30% of total file sizes given over to font data needlessly embedded in each and every PDF! TextBridge-generated files are cleaner from the ground up, superior to Acrobat Capture in this regard.

In PDF/ Normal files, OCR "suspects" may be left on the PDF page as bitmap images to "conceal" the OCR error, if any. We found that the alignment of this word-suspect image and text data quite unsuccessful in most of the pages we tried with TextBridge. If suspect correction is a part of your PDF/ Normal process, however, you may not care about this, since you won't have any suspects anyway.

The other major flaw in TextBridge's PDF/ Normal output is in image-handling. Regardless of input, TB delivers an image zone in 8-bit grayscale (or color), at 100 dpi. Feed it a 300 dpi CCITT G4 image, get 100 dpi grayscale in return (with attendant chunky file size). Without getting into the image readability debate, let's just say that the operator ought to be able to select what happens to their input image! With this enhancement, TextBridge 9.0 BE would be a credible - and fast - PDF/ Normal generator for many types of documents, with nary a dongle click.

Summary

The main question for TextBridge is, can the format issues in Image+ Text files be addressed? Licensing the Adobe libraries may be part of the price for a truly PDF-enabled OCR engine - a strategic decision, I guess. The success of PDF was eventually going to mean competition in the paper to PDF conversion software business - which Adobe invented almost as an afterthought to PDF itself. That competition has almost arrived.

Strengths & weaknesses comparison

ScanSoft TextBridge 9.0 BE
Key Strengths Key Weaknesses
  • Zoning & the Table tool
  • OCR accuracy - woah!
  • Cost
  • "Sub-standard" PDF/ Image+ Text
  • Image handling in PDF/ Normal output
  • No undo functionality
  • Adobe Acrobat Capture 2.01
    Key Strengths Key Weaknesses
  • Superior batch-handling
  • "Reference Standard" PDF/ Image+ Text files
  • Acrobat Capture Reviewer - still the only application generating high-quality PDF/ Normal files "out of the box"
  • Cost
  • OCR accuracy
  • Lack of zoning
  • Samples of Adobe Acrobat Capture output formats: http://www.document-solutions.com
    /definition.html

    TextBridge Pro 9.0 Business Edition - System Requirements

    Software Requirements:

  • Windows 95, 98, 2000 or Windows NT 4.0
  • Internet Explorer 4.0 (or later) or Netscape Navigator 5.0 (or later) are necessary for viewing formatted (WYSIWYG) HTML output.

    Hardware Requirements:

  • Intel (or compatible) 486 or higher

    Memory Requirements:

  • 24MB of RAM (32MB recommended)

    Hard Drive Space required:

  • 20MB

    Scanner/Input Device Drivers:

  • TWAIN

    MORE Info:
    www.scansoft.com/products/tb90be/index.html


  • PDF In-Depth Free Product Trials Ubiquitous PDF

    Debenu Aerialist 11

    The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

    Download free demo

    Debenu PDF Tools Pro

    It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

    Download free demo

    Two Passwords Are Better Than One: The Low-Down On PDF Security

    For people who don't spend their time looking at PDF files in text editors*, PDF security is a sometimes misunderstood beast.

    For example, those document restrictions that PDF files sometimes have -- no Printing, Content Copying, Page Extraction, etc -- are essentially useless unless the PDF also has a User Password.

    January 09, 2014
    Platinum Sponsor



    Search Planet PDF
    more searching options...
    Planet PDF Newsletter
    Most Popular Articles
    Featured Product

    Debenu PDF Aerialist 11

    The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

    Features

    Adding a PDF Stamp Comment

    OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.