Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).

New Forum | Previous | Next | (P-PDF) What's Wrong with my PDF?

Topic: problem extracting text from PDF
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 160067
From: o_hodjati
Date: 5/10/2007 04:52 PM

Hi All,
My application is supposed to extract text from PDF files. I have used Acrobat 8.0 to to do this. Acrobat uses and internal word-breaking algorithm and let me extract the text word-by-word. While PDF files contain only latin characters, everything works fine. But when it comes to the world of not-latin characters, my application encounters some problems:
a- Some characters are incorrectly interpreted as word-breaking characters and so I have 2 "half-word"s instead of the complete word. Unfortunatly this happens to the character that is inserted automatically by Microsoft word in right to left paragraphs when the paragraph is "justified". So, text can not be correctly extracted form any right to left (arabic) docuemtn that has justified paragraphs !
b- Some other unicode characters (including arabic and farsi digits) are not recognized by Acrobat and are replaced by dot character.
c- Acrobat processes only left-to-right and does not consider any right-to-left issue processing mixed lines of text.

I have following questions
1- Is there any way to bypass Acrobat word breaking algorithm or at lease make it partially change its behavior interpreting some specific characters?
2- Is there any way to make Acrobat replace some character with other ones?
3- I know that there is a ME (middle eastern) version of Adobe Acrobat application. Does this version behave better than this non-ME version handling mixed (right-to-left and left-to-right) text ?

I appreciate your help in advance

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Five visions of a PDF Day

In the world of PDFs or as we like to say Planet (of) PDF, a year isn't a real PDF year without an intense few days of industry knowledge sharing.

May 15, 2018
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.