New Forum | Previous | Next | (P-PDF) Developers
Topic: Re: Indexing PDFs
Conf: (P-PDF) Developers, Msg: 57499
Date: 5/29/2002 05:26 PM
> I believe that the PDFToolkit from ActivePDF
> () supports text extraction, and it's a
> COM/ActiveX component.
Alas, it doesn't, it's one of the ones I checked.
> If that doesn't work for you, SPDF is a "high end"
> C/C++ PDF library from Digital Applications ()
> that WILL do this.
Thanks Leonard, we'll certainly take a look.
> Other options include running command line applications like
> pdftext (part of Xpdf - )
Looks interesting, especially the LZW bypass he employs.
> Ghostscript ().
We once researched this, seemed way too clunky.
> And I believe there is a Perl module (Text::PDF) that will do
> Well, even to just get "words", you still have to do work
> because some PDF creation tools will break up words into multiple
> draw operations in order to achieve things like kerning.
Indeed. What I meant was, we don't need the rest! (Well, in
practise, getting Document information like title, subject, keywords
also useful, but that's about it).
Thanks again for the useful "points".
Peter Hyde, WebCentre Ltd & SPIS Ltd, Christchurch, New Zealand
* Web automation for online periodicals: http://TurboPress.com
* TurboNote+: http://TurboPress.com/tbnote.htm
-- easy, small, handy onscreen sticky notes