Planet PDF Forum Archive

Planet PDF ForumThis is from our 160,000 page PDF discussion forum archive one & two spanning 1999-2011. Use the filters on our Advanced Search page to search archive only. Head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions.


Previous | Next | (P-PDF) What's Wrong with my PDF?


Topic: problem extracting text from PDF
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 160067
From: o_hodjati
Date: 5/10/2007 04:52 PM

Hi All,
My application is supposed to extract text from PDF files. I have used Acrobat 8.0 to to do this. Acrobat uses and internal word-breaking algorithm and let me extract the text word-by-word. While PDF files contain only latin characters, everything works fine. But when it comes to the world of not-latin characters, my application encounters some problems:
a- Some characters are incorrectly interpreted as word-breaking characters and so I have 2 "half-word"s instead of the complete word. Unfortunatly this happens to the character that is inserted automatically by Microsoft word in right to left paragraphs when the paragraph is "justified". So, text can not be correctly extracted form any right to left (arabic) docuemtn that has justified paragraphs !
b- Some other unicode characters (including arabic and farsi digits) are not recognized by Acrobat and are replaced by dot character.
c- Acrobat processes only left-to-right and does not consider any right-to-left issue processing mixed lines of text.

I have following questions
1- Is there any way to bypass Acrobat word breaking algorithm or at lease make it partially change its behavior interpreting some specific characters?
2- Is there any way to make Acrobat replace some character with other ones?
3- I know that there is a ME (middle eastern) version of Adobe Acrobat application. Does this version behave better than this non-ME version handling mixed (right-to-left and left-to-right) text ?

I appreciate your help in advance


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Two Passwords Are Better Than One: The Low-Down On PDF Security

For people who don't spend their time looking at PDF files in text editors*, PDF security is a sometimes misunderstood beast.

For example, those document restrictions that PDF files sometimes have -- no Printing, Content Copying, Page Extraction, etc -- are essentially useless unless the PDF also has a User Password.

January 09, 2014
Platinum Sponsor



Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist 11

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.