New Forum | Previous | Next | (P-PDF) Developers
Topic: open source PDF parser
Conf: (P-PDF) Developers, Msg: 55081
Date: 5/29/2002 05:10 PM
I have been lamenting something. I think that it would be a good move for PDF as a whole if there were a freely licenseable library (probably open source) for reading a PDF into memory. I see three potential layers: a layer that can find each object in a file like the CosObj layer in Acrobat, a layer that can understand document structure like Acrobat's PD layer, and a layer that can parse page contents into drawing objects.
I am lamenting that by now, all the people who would possibly contribute to this project (including yours truly) have developed their own PDF parser.
I get the sense that there are a few people (again, including yours truly) who have or are working on code to parse the drawing stream into easily manipulated objects.
The Acrobat API code certainly has bugs, but more importantly, it is available only inside Acrobat, is not thread safe, and more.
I know, I know: GhostScript. But this is not freely licensable. Although it is open source, for commercial applications it ultimately is just another proprietary solution that you can license, just like SPDF and the PDFLib import stuff.
The PDF market started small, and everyone (all three of us) could swing their arms wildly in circles without bumping anyone else. I think we have already hit the point where there are enough entrants into the market that offering a library that will bring more entrants would do more good than harm.
Although Acrobat will likely remain the gold standard, I see many different PDF parsers causing a group of "safe" pdf files that everyone can read and a group of "borderline" pdf files that cause a few of our libraries to break. PostScript became this way, with some people being "more" compatible than others, and Adobe PostScript wasn't even the gold standard on the high end for quite some time.
There are developers who create pdf parsing libraries, and these libraries alone are their added value. There are developers who use already existant libraries (or the Acrobat API through a plug-in), and their added value is the code that sits on top of the API. Some developers do both.
I don't expect that the developers in the first group would have any interest in a shared page stream parser. I suspect that there are developers in the second category who have had to develop a bare-bones parser just to have something to build on top of. I fall into this group, and as I start to ponder new projects, I realize that a more robust parser is in order. I know that other people have expressed dissatisfaction with bugs in the Acrobat parser, and so I think that there might be other people who either have no parser or a less than complete parser who may be interested in joinly developing a better one.
I think of ZLib. Many people need compression services, and ZLib lets us focus on the core of the project where wer really add value instead of letting each of us become a partial expert on compression.
Similarly, a PDF page content stream parser would let us each focus on what we know best. Could someone use such a library to compete with products that each of us currently offer? You bet. But competition is already arriving for those products, and I would rather see each of us working on a new project rather than seeing everybody individually becoming a PDF content parser developer.
I am thinking out loud. Maybe not everyone agrees with my evaluation of the competitive landscape. Maybe those who could most contribute to such a project already have their own parser that is its own added value. Maybe the commercial libraries already offer a complete solution, making a new parser of interest only to grad students who consider time far less valuable than money.