Planet PDF Forum Archive

Planet PDF ForumWowsers! This is page is old, head to the LIVE Planet PDF Forum. It features more than 10 conferences, covering everything from beginner to in-depth developer and pre-press discussions. If you wish to continue... one & two archive covers 1999-2011 (160,000 pages).


New Forum | Previous | Next | (P-PDF) Developers


Topic: open source PDF parser
Conf: (P-PDF) Developers, Msg: 55081
From: DanielAri
Date: 5/29/2002 05:10 PM

I have been lamenting something. I think that it would be a good move for PDF as a whole if there were a freely licenseable library (probably open source) for reading a PDF into memory. I see three potential layers: a layer that can find each object in a file like the CosObj layer in Acrobat, a layer that can understand document structure like Acrobat's PD layer, and a layer that can parse page contents into drawing objects.

I am lamenting that by now, all the people who would possibly contribute to this project (including yours truly) have developed their own PDF parser.

I get the sense that there are a few people (again, including yours truly) who have or are working on code to parse the drawing stream into easily manipulated objects.

The Acrobat API code certainly has bugs, but more importantly, it is available only inside Acrobat, is not thread safe, and more.

I know, I know: GhostScript. But this is not freely licensable. Although it is open source, for commercial applications it ultimately is just another proprietary solution that you can license, just like SPDF and the PDFLib import stuff.

The PDF market started small, and everyone (all three of us) could swing their arms wildly in circles without bumping anyone else. I think we have already hit the point where there are enough entrants into the market that offering a library that will bring more entrants would do more good than harm.

Although Acrobat will likely remain the gold standard, I see many different PDF parsers causing a group of "safe" pdf files that everyone can read and a group of "borderline" pdf files that cause a few of our libraries to break. PostScript became this way, with some people being "more" compatible than others, and Adobe PostScript wasn't even the gold standard on the high end for quite some time.

There are developers who create pdf parsing libraries, and these libraries alone are their added value. There are developers who use already existant libraries (or the Acrobat API through a plug-in), and their added value is the code that sits on top of the API. Some developers do both.

I don't expect that the developers in the first group would have any interest in a shared page stream parser. I suspect that there are developers in the second category who have had to develop a bare-bones parser just to have something to build on top of. I fall into this group, and as I start to ponder new projects, I realize that a more robust parser is in order. I know that other people have expressed dissatisfaction with bugs in the Acrobat parser, and so I think that there might be other people who either have no parser or a less than complete parser who may be interested in joinly developing a better one.

I think of ZLib. Many people need compression services, and ZLib lets us focus on the core of the project where wer really add value instead of letting each of us become a partial expert on compression.

Similarly, a PDF page content stream parser would let us each focus on what we know best. Could someone use such a library to compete with products that each of us currently offer? You bet. But competition is already arriving for those products, and I would rather see each of us working on a new project rather than seeing everybody individually becoming a PDF content parser developer.

I am thinking out loud. Maybe not everyone agrees with my evaluation of the competitive landscape. Maybe those who could most contribute to such a project already have their own parser that is its own added value. Maybe the commercial libraries already offer a complete solution, making a new parser of interest only to grad students who consider time far less valuable than money.

Comments? Volunteers?

Regards,
Dan

PDF In-Depth Free Product Trials Ubiquitous PDF

Debenu Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link...

Download free demo

Debenu PDF Tools Pro

It's simple to use and will let you preview and edit PDF files, it's a Windows application that makes...

Download free demo

Back to the past, 15 years ago! Open Publish 2002

Looking back to 2002, it's amazing how much of the prediction became a reality. Take a read and see what you think!

September 14, 2017
Platinum Sponsor





Search Planet PDF
more searching options...
Planet PDF Newsletter
Most Popular Articles
Featured Product

Debenu PDF Aerialist

The ultimate plug-in for Adobe Acrobat. Advanced splitting, merging, stamping, bookmarking, and link control. Take Acrobat to the next level.

Features

Adding a PDF Stamp Comment

OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.