New Forum | Previous | Next | (P-PDF) Developers
Topic: Font encodings
Conf: (P-PDF) Developers, Msg: 53377
Date: 5/29/2002 04:58 PM
This is probably a FAQ - sorry if it is. I am trying to extract text from PDF files. A problem I have is that when the text is in an embedded font (it seems) some characters are non-standard. In some instances the characters appear to be random and in others just a few characters (like double backquotes) appear to be non-standard. Is this a problem of a) some fonts using non-standard encodings or b) the fact that Acrobat Distiller is doing its own encoding of text in embedded fonts? If it's (b) then can the original character encodings be deduced from the information in the PDF file? Thanks for any help. BTW I'm using Java and the PJ Etymon library to do the PDF parsing.