Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) Developers


Topic: extract plain text from pdf
Conf: (P-PDF) Developers, Msg: 113843
From: menakaindrani
Date: 6/22/2004 07:58 AM

Hi,

To extract text from pdf is pretty simple. there is no need to use itext for this. what you can do is this:

As you know all the text in the PDF file lies in the stream in BT ET.

stream

BT.....ET

endstream

in BT ET Tj means Show Text and TJ means show multiple lines of text.

Also as you know the streams in a pdf file are compressed so your first step is to decompress them...by zlib inflate deflate functionality or by some method in iText.

So first step is to Decompress the stream.
take it in a buffer. Read the buffer and then inside the stream look for BT ET and inside BT ET look for Tj ot TJ and capture your text....easy

EXAMPLE 1 FOR Tj
In the follwing stream the text is "Small test".
decoded stream looks like:

q Q q 417.783 228.232 344.126 520.191 re W n /Perceptual ri q 346.126
0 0 523.214 416.783 226.72 cm /Im1 Do Q Q q 243.052 155.613 693.588
62.2451 re W n /Gs1 gs 1 sc q 1 0 0 -1 243.052 217.858 cm 0 0 693.588
62.2451 re f Q Q q 243.052 155.613 693.588 62.2451 re W n q 1 0 0 -1
243.052 217.858 cm BT 41.4967 0 0 -41.4967 265 49 Tm /F1.0 1 Tf (Small test)
Tj ET Q Q q 243.052 108.188 693.588 41.4967 re W n /Gs1 gs 1 sc q 1
0 0 -1 243.052 149.685 cm 0 0 693.588 41.4967 re f Q Q

EXAMPLE 2 FOR TJ
In the follwing stream the text is "In this section, a simple component object invocation".

[(In)8( t)-9(h)8(is)-7( sectio)-16(n)8(,)-1( a )-12(si)-9(m)21(p)-4(le)-12( co)-16(m)9(pon)8(en)8(t obj)-9(ect inv)8(o)-4(cation )]TJ

So by now you must have got the hint that you have to read every thing in side simple brackets()....

BT ET are never nested but can continue across streams so you have to be careful in not ending when you encounter endstream instead read it till endstream and again look for stream and continue reading till you find ET...right

like:
stream

BT

ET

BT
endstream

stream

ET

endstream

Menaka Indrani

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.