Planet PDF Forum Archive

Planet PDF ForumThe page you are viewing is part of our 160,000 page PDF discussion forum archive spanning 1999-2008. Would you believe we have a 2nd forum archive which covers 2008 - 2011? But... if you really want to bust-a-move head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner discussions to in-depth developer and pre-press discussions.


How to search this archive. The quickest way is to use the filters on our Advanced Search page so that only archive pages are included in the results.


Previous | Next | (P-PDF) PDF Accessibility


Topic: RE: OCR compatibility with tag tree/searching (Via Email)
Conf: (P-PDF) PDF Accessibility, Msg: 134329
From: Duff_Johnson
Date: 6/11/2005 02:07 AM

> I'm trying to create accessible, searchable PDFs from paper texts.
>
> I scan them as 300 dpi bitonal TIFFs and then perform OCR
> using OmniPage Pro 11. Then I save back as PDF with image on
> text. Retaining the image is important for these archival documents.
>
> The first problem I'm having is with the tag tree. In
> Acrobat 6.0 Professional, I use the Add Tags to Document
> option in the Accessibility portion of the Advanced menu.
> This invariably creates a Figure tag for the page image and a
> Table tag for the page of text. Each line (or portion there
> of) of text is given a Paragraph tag. Also, the graphic
> zones that I've created in OmniPage are not recognized as
> Figures in Acrobat. This, to me, makes the tag tree
> relatively useless. But, the Web Accessibility people at my
> institution teach "an accessible PDF is a tagged PDF."

They are correct!

First of all, the fact that you are saving your OmniPage file to
PDF/Searchable Image (Image on Text) is the factor causing the
image-zones you defined in OmniPage to disappear. Creating image-zones
is only meaningful for PDF/Formatted Text and Graphics (PDF/Normal)
output, which won't work for you on account of your need to retain the
entire page-image.

I would remove the page-image from the structure tree, then retag the
text so it's not in a table (unless it should be, of course).

> Do you have any experience with this or know of any solution?
> Do other OCR programs work better with Acrobat?

Well... Acrobat Capture 3.05 has old OCR, but it has smoother
integration with the tagging process, although the process is not at all
refined, and has some serious gaps.

The thing to focus on is getting the best-possible OCR results (not
software), then address the tagging in Acrobat. If accessibility is a
priority for your scanned document, then correcting the OCR to (a) clean
out "debris" from graphics and (b) ensure all the words will read
correctly is a must.

Having created a high-quality OCRed document (and this is NOT an
all-automation process), you can then tag them in Acrobat (Acrobat 7.0
Professional is better for this than 6.0 Pro).

> Also, some of my documents are math papers with formulas.
> The powers that be want the formulas to be expressed in
> natural language for screen readers as well as searchable by
> LaTex. Any suggestions on where to hide all of this text?

To do that right, you'd need to drop the PDF/Searchable Image model,
because it simply does not allow you multiple image zones on the page.
However, since I'm sure I can't talk you out of Searchable Image PDF,
then I'm afraid you are left with "hiding" the text as "Actual Text" for
some convenient text-element (such as a period). This is ugly, but it
works. However, I would really suggest considering Formatted Text and
Graphics for these pages.

> My
> current (but untested) solution is to put the natural
> language in the OCR layer and the LaTex in Bookmarks,
> although this seems less than ideal, and I'm not sure how it
> will work across platforms.

You mean... you are adding LaTex code as bookmark text?!? Could you
explain this a bit further?

Duff Johnson
Document Solutions, Inc.
http://www.document-solutions.com


PDF In-Depth Free Product Trials Ubiquitous PDF

LockLizard Safeguard PDF Security

Made specifically for publishers of high value information published in PDF format, it protects your PDF...

Download free demo

ARTS PDF Aerialist X

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Ubiquitous PDF: DIY PDF magazines, courtesy of CNET and Magazinify

Thanks to Magazinify.com, it's possible to have web articles delivered right to your inbox in PDF form. If that weren't enough, the nice folks at CNET have been nice enough to publish a step-by-step guide about how to set this all up using just a little time and a free Magazinify account.

September 06, 2011
Search Planet PDF
more searching options...
PDF Resources
Platinum Sponsor

ARTS PDF

Create & Edit PDF - Nitro PDF Software

Silver Sponsors

LockLizard DRM PDF Security Quick PDF Library: The Unrivaled PDF Developer Toolkit

Featured Product

ARTS PDF Crackerjack X

The most popular Acrobat plug-in for PDF-based color print production and automation.

Featured Event

Adobe Digital Marketing Summit

March 20-23, 2012 -- Salt Palace Convention Center, Salt Lake City, Utah

The Digital Marketing Summit is the premier event for digital marketers and advertisers to learn about and share key strategies for driving marketing innovation. Attend Summit to learn how you can create, measure, and optimize digital experiences to revolutionize how the world engages with ideas and information.

PDF Store Categories