Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) What's Wrong with my PDF?


Topic: Extracting text from PDF problem
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 162943
From: Granty555
Date: 10/19/2007 12:50 AM

Hello all,

first post here so please excuse me if I've posted in the wrong forum!

I have to extract the text from multiple (read thousands) of PDFs to excel so I can manipulate the data for later use. Some are scanned images etc and I know I have to use OCR for those. However, the vast majority contain (as far as I know) proper text. I can open them in adobe reader, select some text, copy-paste it to notepad and it comes out fine.

My problem is that the content is table based and very complicated (and unfortunately confidential so I cant post an example) and I need to keep the format exactly as it is so my parsing routines can correctly build up the data model that is required. I have tried several text-extraction products but each one seems to fail in some way.

ABBYY Finereader 8.0 keeps the format beautifully, but does not recognise the text! It extracts some of it (e.g. table column-headers, footnotes) but any data actually in the table is not recognised at all. Other text extractors seem to have no problem getting all the text but can not reproduce the format in Excel. Able2Extract can extract the text, and the format isn't bad, but it seems to create double-spacing between words and add extra spaces if it encounters ".."
The text must be perfect as the parsing routine uses a lot of string-matching, and with tens of thousands of pages, manual correction (or even checking) is not acceptable.

I have spoken to ABBYY support but they require a sample of the problem pdf and I am not authorised to release it.

Does anyone have any ideas why ABBYY and A2E are having problems extracting the text? Or does anyone know a product that can reliably extract such text while mirroring the table-format perfectly in the Excel output?

Many thanks

J.



PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist 12

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

PDF Master Series III: Eugene Y. Xiong talks with Planet PDF

Planet PDF talks with another Master of the PDF Universe, Eugene Y. Xiong, Founder and Chairman of the Board at Foxit Software Inc. in Fremont California. Xiong is a quiet yet astounding achiever, you (usually) won't find him talking at conferences, exhibits, or publishings, but what you will find is the result of his leadership in places you would never expect.

September 14, 2016
Platinum Sponsor



Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist 12

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.