Previous | Next | (P-PDF) What's Wrong with my PDF?
Topic: problem extracting text from PDF
Conf: (P-PDF) What's Wrong with my PDF?, Msg: 160067
Date: 5/10/2007 04:52 PM
My application is supposed to extract text from PDF files. I have used Acrobat 8.0 to to do this. Acrobat uses and internal word-breaking algorithm and let me extract the text word-by-word. While PDF files contain only latin characters, everything works fine. But when it comes to the world of not-latin characters, my application encounters some problems:
a- Some characters are incorrectly interpreted as word-breaking characters and so I have 2 "half-word"s instead of the complete word. Unfortunatly this happens to the character that is inserted automatically by Microsoft word in right to left paragraphs when the paragraph is "justified". So, text can not be correctly extracted form any right to left (arabic) docuemtn that has justified paragraphs !
b- Some other unicode characters (including arabic and farsi digits) are not recognized by Acrobat and are replaced by dot character.
c- Acrobat processes only left-to-right and does not consider any right-to-left issue processing mixed lines of text.
I have following questions
1- Is there any way to bypass Acrobat word breaking algorithm or at lease make it partially change its behavior interpreting some specific characters?
2- Is there any way to make Acrobat replace some character with other ones?
3- I know that there is a ME (middle eastern) version of Adobe Acrobat application. Does this version behave better than this non-ME version handling mixed (right-to-left and left-to-right) text ?
I appreciate your help in advance