PDFlib has announced the release of the new version of its PDF content extraction engine. The latest edition improves page content analysis, supports right-to-left languages like Arabic and Hebrew, and offers advanced Unicode post-processing controls.
The updates to the engine have been implemented in the PDFlib TET (Text Extraction Toolkit) family of products: PDFlib TET 4, PDFlib TET PDF IFilter 4, and TET Plugin 4. The results of PDF text extraction have been enhanced with improved shadow removal, word boundary detection and de-hyphenation, along with superscript and
subscript detection. More workarounds for non-conforming PDF documents improve the robustness of text extraction; the enhanced repair mode can successfully extract text from damaged PDFs.
TET 4 rearranges bidirectional text in Arabic or Hebrew documents to the proper logical order. Unicode post-processing controls offer folding, decomposition and normalization according to the Unicode standard which is useful to adjust the extracted text according to the requirements of the application.
TET is also available as a free plugin for Adobe Acrobat. The plugin supports Unicode syntax for search text and can highlight search hits on a page. Additionally, PDFlib also offers what it calls "the TET Cookbook", a collection of programming examples that demonstrate the use of TET for text and image extraction tasks.
OK, so you want to stamp your document. Maybe you need to give reviewers some advice about the document's status or sensitivity. This tip from author Ted Padova demonstrates how to add stamps with the Stamp Tool along with related comments.