Planet PDF Forum Archive

Planet PDF ForumThis is from our 160,000 page PDF discussion forum archive one & two spanning 1999-2011. Use the filters on our Advanced Search page to search archive only. Head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions.


Previous | Next | (P-PDF) What's Wrong with my PDF?


Topic: Extracting text from PDF problem
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 162943
From: Granty555
Date: 10/19/2007 12:50 AM

Hello all,

first post here so please excuse me if I've posted in the wrong forum!

I have to extract the text from multiple (read thousands) of PDFs to excel so I can manipulate the data for later use. Some are scanned images etc and I know I have to use OCR for those. However, the vast majority contain (as far as I know) proper text. I can open them in adobe reader, select some text, copy-paste it to notepad and it comes out fine.

My problem is that the content is table based and very complicated (and unfortunately confidential so I cant post an example) and I need to keep the format exactly as it is so my parsing routines can correctly build up the data model that is required. I have tried several text-extraction products but each one seems to fail in some way.

ABBYY Finereader 8.0 keeps the format beautifully, but does not recognise the text! It extracts some of it (e.g. table column-headers, footnotes) but any data actually in the table is not recognised at all. Other text extractors seem to have no problem getting all the text but can not reproduce the format in Excel. Able2Extract can extract the text, and the format isn't bad, but it seems to create double-spacing between words and add extra spaces if it encounters ".."
The text must be perfect as the parsing routine uses a lot of string-matching, and with tens of thousands of pages, manual correction (or even checking) is not acceptable.

I have spoken to ABBYY support but they require a sample of the problem pdf and I am not authorised to release it.

Does anyone have any ideas why ABBYY and A2E are having problems extracting the text? Or does anyone know a product that can reliably extract such text while mirroring the table-format perfectly in the Excel output?

Many thanks

J.



PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist 12

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Why eBooks and mobile devices don't (yet) pose an existential threat to the PDF

Continuous upheaval is what makes watching the technology industry so exciting. David vs. Goliath battles are waged every day, with startups often winning against much larger businesses. For years and years, many have predicted the decline of the PDF given its age and perceived disadvantages. Today, with the PDF losing ground in emerging areas like mobile and eBooks, the calls for its ultimate demise are growing louder.

February 02, 2016
Platinum Sponsor



Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist 12

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.