Planet PDF Forum Archive

Planet PDF ForumThe page you are viewing is part of our 160,000 page PDF discussion forum archive spanning 1999-2008. Would you believe we have a 2nd forum archive which covers 2008 - 2011? But... if you really want to bust-a-move head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner discussions to in-depth developer and pre-press discussions.


How to search this archive. The quickest way is to use the filters on our Advanced Search page so that only archive pages are included in the results.


Previous | Next | (P-PDF) What's Wrong with my PDF?


Topic: Extracting text from PDF problem
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 162943
From: Granty555
Date: 10/19/2007 12:50 AM

Hello all,

first post here so please excuse me if I've posted in the wrong forum!

I have to extract the text from multiple (read thousands) of PDFs to excel so I can manipulate the data for later use. Some are scanned images etc and I know I have to use OCR for those. However, the vast majority contain (as far as I know) proper text. I can open them in adobe reader, select some text, copy-paste it to notepad and it comes out fine.

My problem is that the content is table based and very complicated (and unfortunately confidential so I cant post an example) and I need to keep the format exactly as it is so my parsing routines can correctly build up the data model that is required. I have tried several text-extraction products but each one seems to fail in some way.

ABBYY Finereader 8.0 keeps the format beautifully, but does not recognise the text! It extracts some of it (e.g. table column-headers, footnotes) but any data actually in the table is not recognised at all. Other text extractors seem to have no problem getting all the text but can not reproduce the format in Excel. Able2Extract can extract the text, and the format isn't bad, but it seems to create double-spacing between words and add extra spaces if it encounters ".."
The text must be perfect as the parsing routine uses a lot of string-matching, and with tens of thousands of pages, manual correction (or even checking) is not acceptable.

I have spoken to ABBYY support but they require a sample of the problem pdf and I am not authorised to release it.

Does anyone have any ideas why ABBYY and A2E are having problems extracting the text? Or does anyone know a product that can reliably extract such text while mirroring the table-format perfectly in the Excel output?

Many thanks

J.



PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

LockLizard Safeguard PDF Security

Made specifically for publishers of high value information published in PDF format, it protects your PDF...

Download free demo

Ubiquitous PDF: DIY PDF magazines, courtesy of CNET and Magazinify

Thanks to Magazinify.com, it's possible to have web articles delivered right to your inbox in PDF form. If that weren't enough, the nice folks at CNET have been nice enough to publish a step-by-step guide about how to set this all up using just a little time and a free Magazinify account.

September 06, 2011
Search Planet PDF
more searching options...
PDF Resources
Platinum Sponsor

Debenu - Unrivaled PDF Productivity | PDF Library, Acrobat Plug-Ins

Create & Edit PDF - Nitro PDF Software

Silver Sponsors

LockLizard DRM PDF Security Quick PDF Library: The Unrivaled PDF Developer Toolkit

Featured Product

Debenu PDF Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Featured Event

Adobe Digital Marketing Summit

March 20-23, 2012 -- Salt Palace Convention Center, Salt Lake City, Utah

The Digital Marketing Summit is the premier event for digital marketers and advertisers to learn about and share key strategies for driving marketing innovation. Attend Summit to learn how you can create, measure, and optimize digital experiences to revolutionize how the world engages with ideas and information.