New Forum | Previous | Next | (P-PDF) Developers
Topic: Want to extract text wrapped by ms-word style tag in PDF
Conf: (P-PDF) Developers, Msg: 96758
Date: 9/20/2003 06:16 AM
I am new to this forum. Actually I am new to PDF.
I need to parse pdf files and extract some interesting texts out. These PDF files are originally converted from MS-WORD by Acrobat PDFMaker. The texts I want are wrapped by MS word style format tag like Heading 1 .
It seems there are two ways to do. One way is that I could convert these word files to tagged pdf files with Acrobat PDFmaker, that is, those word style tag will be converted to tags in pdf. So my program( java on linux) needs to extract contents wrapped by those tags in pdf. But I couldn't find any PDF libraries support text extraction from tagged PDF. I have checked with iText, JPedal, and Pdfbox, it seems that they don't support it right now. Anybody have an idea?
The other approach is that I need to program a plug-in (on windows) with Microsoft Word, then I could extract the text (ex. abstracts with the abstract_contents style format) from MS word and put to an XML file or directly writes to metadata stream in the PDF file. After word file is converted to pdf, I could extract the metadata from pdf. For this approach, I don't know if there is any library or plug-in available for me to extract text from MS-WORD file. Anybody knows?
Appreciate your help!