New Forum | Previous | Next | (P-PDF) Developers
Topic: extract text and check embedded fonts?
Conf: (P-PDF) Developers, Msg: 69603
Date: 7/31/2002 07:30 AM
I'm trying to replace a PDF workflow that is several years old. There are two things I need to do:
1. be able to roughly extract the text from the PDF, in order to feed it to a search engine.
2. examine the file to verify that all fonts (outside of the base14) are embedded (subsetting is OK but not substitution)
Can someone recommend a cheap/easy way of doing this?
I've seen various attempts at (1) already discussed, and it appears that compression and various word placement artifacts make this nontrivial to program directly (shame!). The existing code is using an old version of the Adobe API but we'd like to not use it going forward due to Adobe's server licensing stance.
Existing libs seem to be the way to go, but given that what I'm doing is somewhat trivial (IMO) and most libs seem to be focused on generating PDFs rather than reading them, I'm having problems finding a free/cheap/shareware solution.
I'm also hoping that (2) can be done easily by inspection? Before I tackle the PDF spec, I'd like to know if it's a wild goose chase...