New Forum | Previous | Next | (P-PDF) What's Wrong with my PDF?
Topic: Extracting text from PDF problem
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 162943
Date: 10/19/2007 12:50 AM
first post here so please excuse me if I've posted in the wrong forum!
I have to extract the text from multiple (read thousands) of PDFs to excel so I can manipulate the data for later use. Some are scanned images etc and I know I have to use OCR for those. However, the vast majority contain (as far as I know) proper text. I can open them in adobe reader, select some text, copy-paste it to notepad and it comes out fine.
My problem is that the content is table based and very complicated (and unfortunately confidential so I cant post an example) and I need to keep the format exactly as it is so my parsing routines can correctly build up the data model that is required. I have tried several text-extraction products but each one seems to fail in some way.
ABBYY Finereader 8.0 keeps the format beautifully, but does not recognise the text! It extracts some of it (e.g. table column-headers, footnotes) but any data actually in the table is not recognised at all. Other text extractors seem to have no problem getting all the text but can not reproduce the format in Excel. Able2Extract can extract the text, and the format isn't bad, but it seems to create double-spacing between words and add extra spaces if it encounters ".."
The text must be perfect as the parsing routine uses a lot of string-matching, and with tens of thousands of pages, manual correction (or even checking) is not acceptable.
I have spoken to ABBYY support but they require a sample of the problem pdf and I am not authorised to release it.
Does anyone have any ideas why ABBYY and A2E are having problems extracting the text? Or does anyone know a product that can reliably extract such text while mirroring the table-format perfectly in the Excel output?