Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) Developers


Topic: Re: open source PDF parser
Conf: (P-PDF) Developers, Msg: 57880
From: LeonardR
Date: 5/29/2002 05:28 PM

--=====================_158760345==_.ALT
Content-Type: text/plain; charset="us-ascii"; format=flowed

At 08:19 AM 5/11/2001 -0400, p-pdf-developer Listmanager wrote:
>I have been lamenting something. I think that it would be a good move for
>PDF as a whole if there were a freely licenseable library (probably open
>source) for reading a PDF into memory.

Xpdf - . Xpdf is written in
portable C++ and can be commercially licensed as well as also being
available under the GPL.

I also saw a reference recently to a Lex/Yacc-based PDF parser
called PandaLex - see



>I see three potential layers: a layer that can find each object in a file
>like the CosObj layer in Acrobat, a layer that can understand document
>structure like Acrobat's PD layer, and a layer that can parse page
>contents into drawing objects.

All available in Xpdf.

HOWEVER, that will ONLY get you reading and parsing. NONE of the
open source parsers (Xpdf, Ghostscript, Panda, etc.) support rewriting that
information back into a file. So if your only goal is to read the PDF
document into memory for some reason, these can/will certainly meet your
needs - but if you also wish to reuse that content, you've got a LOT more
work ahead of you.


>I know, I know: GhostScript. But this is not freely licensable.

GhostScript is most certainly licensable - see Artifex
(). HOWEVER, it is not a native PDF
parser. Instead it is a Postscript parser that executes PS code to
"translate" a PDF into a PS stream that the internal PS parser can
handle. This is just fine for rasterizing PDF, but isn't suitable for
general PDF parsing needs.


>Although Acrobat will likely remain the gold standard,

Only that it (in theory) is the implementation closest to the
spec. However, in terms of functionality, robustness, portability, etc. it
leads a LOT to be desired!


>I see many different PDF parsers causing a group of "safe" pdf files that
>everyone can read and a group of "borderline" pdf files that cause a few
>of our libraries to break.

That's already the case today, where as more and more PDF
generators are appearing they are doing things a little bit differently
(some more or less according to spec) and that causes those of us with PDF
parsers to tweak accordingly.


Leonard

--------------------------------------------------------------------------------------------------------------------------
Leonard
Rosenthol
Director of Software Development (215) 922-3509 (voice)
Appligent, Inc. (formerly Digital Applications) (610) 284-4233 (fax)

PGP Fingerprint: 8CC9 8878 921E C627 0BC1 15BB FC19 64A9 0016 1397

--=====================_158760345==_.ALT
Content-Type: text/html; charset="us-ascii"


At 08:19 AM 5/11/2001 -0400, p-pdf-developer Listmanager wrote:

I have been lamenting
something.  I think that it would be a good move for PDF as a whole
if there were a freely licenseable library (probably open source) for
reading a PDF into memory. 


        Xpdf -
<http://www.foolabs.com/xpdf>.  
Xpdf is written in portable C++ and can be commercially licensed as well
as also being available under the GPL.


        I also saw
a reference recently to a Lex/Yacc-based PDF parser called PandaLex - see
<http://www.stillhq.com/cgi-bin/getpage?area=pandalex&page=index.htm>




I see three potential
layers: a layer that can find each object in a file like the CosObj layer
in Acrobat, a layer that can understand document structure like Acrobat's
PD layer, and a layer that can parse page contents into drawing
objects.



        All
available in Xpdf.


        HOWEVER,
that will ONLY get you reading and parsing.   NONE of the open
source parsers (Xpdf, Ghostscript, Panda, etc.) support rewriting that
information back into a file.  So if your only goal is to read the
PDF document into memory for some reason, these can/will certainly meet
your needs - but if you also wish to reuse that content, you've got a LOT
more work ahead of you.




I know, I know:
GhostScript.  But this is not freely
licensable.


        GhostScript
is most certainly licensable - see Artifex
(<http://www.artifex.com>).  
HOWEVER, it is not a native PDF parser.  Instead it is a Postscript
parser that executes PS code to "translate" a PDF into a PS
stream that the internal PS parser can handle.    This is
just fine for rasterizing PDF, but isn't suitable for general PDF parsing
needs.




Although Acrobat will likely
remain the gold standard,


        Only that
it (in theory) is the implementation closest to the spec.  However,
in terms of functionality, robustness, portability, etc. it leads a LOT
to be desired!




I see many different PDF
parsers causing a group of "safe" pdf files that everyone can
read and a group of "borderline" pdf files that cause a few of
our libraries to break. 


        That's
already the case today, where as more and more PDF generators are
appearing they are doing things a little bit differently (some more or
less according to spec) and that causes those of us with PDF parsers to
tweak accordingly. 




Leonard



--------------------------------------------------------------------------------------------------------------------------

Leonard
Rosenthol                                   
<mailto:leonardr@appligent.com>

Director of Software
Development                      
(215) 922-3509 (voice)

Appligent, Inc. (formerly Digital Applications) 
(610) 284-4233 (fax)



PGP Fingerprint: 8CC9 8878 921E C627 0BC1  15BB FC19 64A9 0016 1397



--=====================_158760345==_.ALT--


PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.