Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).

New Forum | Previous | Next | (P-PDF) Developers

Topic: Want to extract text wrapped by ms-word style tag in PDF
Conf: (P-PDF) Developers, Msg: 96758
From: elmer
Date: 9/20/2003 06:16 AM

Hi everybody,

I am new to this forum. Actually I am new to PDF.

I need to parse pdf files and extract some interesting texts out. These PDF files are originally converted from MS-WORD by Acrobat PDFMaker. The texts I want are wrapped by MS word style format tag like Heading 1 .

It seems there are two ways to do. One way is that I could convert these word files to tagged pdf files with Acrobat PDFmaker, that is, those word style tag will be converted to tags in pdf. So my program( java on linux) needs to extract contents wrapped by those tags in pdf. But I couldn't find any PDF libraries support text extraction from tagged PDF. I have checked with iText, JPedal, and Pdfbox, it seems that they don't support it right now. Anybody have an idea?

The other approach is that I need to program a plug-in (on windows) with Microsoft Word, then I could extract the text (ex. abstracts with the abstract_contents style format) from MS word and put to an XML file or directly writes to metadata stream in the PDF file. After word file is converted to pdf, I could extract the metadata from pdf. For this approach, I don't know if there is any library or plug-in available for me to extract text from MS-WORD file. Anybody knows?

Appreciate your help!


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor

Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.


Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.