New Forum | Previous | Next | (P-PDF) Developers
Topic: why deleting objects does not clear them
Conf: (P-PDF) Developers, Msg: 59906
Date: 5/29/2002 05:42 PM
Pete, you seem to have discovered the bottom line - that closing and
re-creating the doc is necessary. I will take a stab at _why_ this is
necessary for those who are curious.
Here is my guess as to what Acrobat is doing. I do not work for Adobe and
have seen no code for any internal part of Acrobat, but I know the PDF file
format pretty well; so here is a guess.
When you create a new file, Acrobat creates some structures, including a
table that lists the location of all indirect objects in that PDF file -
much like the xref table in a PDF file, but with additional fields that
indicate that the object is in the file, in a temp file, or in
memory. Don't be fooled; even a pdf file with no pages still has a few
objects such as the root dictionary and a few others that must exist. So I
bet that Acrobat creates these objects as well.
This is why creating a new document takes a little more time than calling
malloc once for a memory structure.
Now, you copy your pages from the other doc. This creates several indirect
objects for the page, links, bookmarks, fonts (hopefully not making
multiple copies if they are not already there), and more. Acrobat can find
every one of these indirect objects because they are referenced in that table.
Here is where it gets interesting and slightly more conjectural. When you
delete those pages, Acrobat surely removes references to them from the
pages tree. However even though the indirect objects for the page (page
dict and contents stream and more) are no longer referenced by the pages
tree, they may still be listed in the table that lists all indirect
objects. Thus, when the file is written, Acrobat steps through the table
and all indirect objects get written to disk. Acrobat does not realize
that some of those objects are not referenced anywhere in the document
structure (other than the xref table), leaving you with bloat. Garbage
collecting probably gets rid of some of this.
If you are curious, you could make two files. One file with fresh pages in
it. Another file with some other pages that are inserted and deleted
before inserting the same pages that are in the first file. Read through
both files and see what objects exist in the second file that do not exist
in the first.