Planet PDF Forum Archive

Planet PDF ForumThis is from our 160,000 page PDF discussion forum archive one & two spanning 1999-2011. Use the filters on our Advanced Search page to search archive only. Head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions.


Previous | Next | (P-PDF) What's Wrong with my PDF?


Topic: Extracting text from PDF problem
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 162943
From: Granty555
Date: 10/19/2007 12:50 AM

Hello all,

first post here so please excuse me if I've posted in the wrong forum!

I have to extract the text from multiple (read thousands) of PDFs to excel so I can manipulate the data for later use. Some are scanned images etc and I know I have to use OCR for those. However, the vast majority contain (as far as I know) proper text. I can open them in adobe reader, select some text, copy-paste it to notepad and it comes out fine.

My problem is that the content is table based and very complicated (and unfortunately confidential so I cant post an example) and I need to keep the format exactly as it is so my parsing routines can correctly build up the data model that is required. I have tried several text-extraction products but each one seems to fail in some way.

ABBYY Finereader 8.0 keeps the format beautifully, but does not recognise the text! It extracts some of it (e.g. table column-headers, footnotes) but any data actually in the table is not recognised at all. Other text extractors seem to have no problem getting all the text but can not reproduce the format in Excel. Able2Extract can extract the text, and the format isn't bad, but it seems to create double-spacing between words and add extra spaces if it encounters ".."
The text must be perfect as the parsing routine uses a lot of string-matching, and with tens of thousands of pages, manual correction (or even checking) is not acceptable.

I have spoken to ABBYY support but they require a sample of the problem pdf and I am not authorised to release it.

Does anyone have any ideas why ABBYY and A2E are having problems extracting the text? Or does anyone know a product that can reliably extract such text while mirroring the table-format perfectly in the Excel output?

Many thanks

J.



PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Two Passwords Are Better Than One: The Low-Down On PDF Security

For people who don't spend their time looking at PDF files in text editors*, PDF security is a sometimes misunderstood beast.

For example, those document restrictions that PDF files sometimes have -- no Printing, Content Copying, Page Extraction, etc -- are essentially useless unless the PDF also has a User Password.

January 09, 2014
Platinum Sponsor



Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.