Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) Developers


Topic: How I can retrieve PDF file's words quickly?
Conf: (P-PDF) Developers, Msg: 69209
From: kww731029
Date: 7/24/2002 06:50 PM

Hello:

I have a question that I wrote a program using VB6.0 and Acrobat SDK1.3. But speed of reading PDF Word is very slow, and I have a very large PDF files(about 1000 files, size about 1.7GB). So reading pdf word is very painful for me. Could you please help me as soon as possible? Thanks in advance!

My code show as below:

Public Function getPDFContent(ByVal pdfFileName As String) As String()
'this method will get every word in a certain pdf file
'then fill these words into a String array
'return a String() including all PDF word


'open acrobat to read file
'openPDF

Dim pdDoc As Acrobat.CAcroPDDoc

'Dim Page As Acrobat.CAcroPDPage
Dim jso As Object


Dim pageCount As Integer

Dim pageWordCount As Integer

Dim pdfContent(30000), pdfContentTmp() As String

Dim I, J, K As Integer

K = 0

'ReDim pdfContent(getPDFWordNo(pdfFileName))

'form pdf doc object
Set pdDoc = CreateObject("AcroExch.PDDoc")

'see if the file can be open
If pdDoc.Open(pdfFileName) Then

Set jso = pdDoc.GetJSObject

pageCount = pdDoc.GetNumPages

Dim word As String


'for each page
For J = 0 To pageCount - 1

pageWordCount = jso.getPageNumWords(J)
'for each word
For I = 0 To pageWordCount - 1

If Not (Trim(jso.getPageNthWord(J, I)) = "") Then

pdfContent(K) = jso.getPageNthWord(J, I)

K = K + 1

End If


Next I
Next J

'if file can not be opened
Else
Debug.Print "could not open pdf file " + pdfFileName
End If

pdDoc.Close

Set jso = Nothing

Set pdDoc = Nothing



'clear space which appears in pdf content array
ReDim pdfContentTmp(K)

For J = 0 To K
pdfContentTmp(J) = pdfContent(J)
Next J



Dim XX As Integer


For XX = 0 To UBound(pdfContent)
If Not (Trim(pdfContent(XX)) = "") Then
pdfContentTmp(XX) = Trim(pdfContent(XX))
End If
Next

getPDFContent = pdfContentTmp


End Function


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.