Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) Developers


Topic: Extracting text from PDF using Acrobat object
Conf: (P-PDF) Developers, Msg: 53232
From: kohr-ah
Date: 5/29/2002 04:57 PM

I'm trying to select all the text out of a PDF document so that I can perform some functions on the text. What is the best way of doing it? The way I tried was creating an instance of the Acrobat application object, then creating a AVDoc and opening the document, then creating a PDDoc, and using the GetText method of the PDDoc using a rectangle and returning a PDTextSelect object where I'm looping through the text using the NumText and GetText property/method. However using just one rectangle will not return the entire contents of the page. I've experimented using two rectangles that will capture a half page each and concatenating them to form a full page of text, but that doesn't work right. Any suggestions? Here is the code I have tried (it is in VB):

Dim AcroApp As CAcroApp
Dim PDDoc As CAcroPDDoc
Dim AVDoc As CAcroAVDoc
Dim Rect As CAcroRect
Dim Rect2 As CAcroRect
Dim objText As CAcroPDTextSelect
Dim strText As String
Dim ii As Long, hh As Long

Open "c:\output.txt" For Output As #1
'Set AcroApp = CreateObject("AcroExch.App")
Set AcroApp = GetObject("", "AcroExch.App")
Set Rect = CreateObject("AcroExch.Rect")
Set Rect2 = CreateObject("AcroExch.Rect")

'A Full Screen
'Rect.Top = 725 'Bigger seems to be better
'Rect.Left = 50 'Lower seems to be better
'Rect.Right = 440 'Bigger seems to be better
'Rect.bottom = 50 'Lower seems to be better

'Two Half Screens
'Rect2.Top = 362
'Rect2.Left = 50
'Rect2.Right = 500
'Rect2.bottom = 50
'Rect.Top = 725
'Rect.Left = 50
'Rect.Right = 500
'Rect.bottom = 363

'Experimental
Rect2.Top = 377
Rect2.Left = 50
Rect2.Right = 500
Rect2.bottom = 50

Rect.Top = 725
Rect.Left = 50
Rect.Right = 500
Rect.bottom = 380

strText = ""
Set AVDoc = AcroApp.GetActiveDoc
AVDoc.Open "C:\test.PDF", "C:\test-wip.PDF"
Set PDDoc = AVDoc.GetPDDoc
For ii = 0 To PDDoc.GetNumPages - 1
Set objText = PDDoc.CreateTextSelect(ii, Rect)
'strText = strText & objText.GetText(1)
For hh = 1 To objText.GetNumText
'MsgBox objText.GetNumText
'If objText.GetNumText > 0 Then
strText = strText & objText.GetText(hh - 1)
'End If
Next
Set objText = PDDoc.CreateTextSelect(ii, Rect2)
For hh = 1 To objText.GetNumText
'MsgBox objText.GetNumText
'If objText.GetNumText > 0 Then
strText = strText & objText.GetText(hh - 1)
'End If
Next
Print #1, CleanText(strText)
strText = ""
Next
Close #1
Set PDDoc = Nothing
AVDoc.Close (True)
Set AVDoc = Nothing
AcroApp.Exit
Set AcroApp = Nothing
Set Rect = Nothing
Exit Sub

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.