PDF Compression Job

A colleague recently related to me that a client of his performed a market survey to find a product that would compress the PDF files stored in their Content Server.  Their concern was over the amount of storage space consumed by uncompressed PDFs.  (This client received PDF files from a variety of sources and was unable to control the compression of the PDFs at their sources, so was looking for an after-the-fact solution to compress them.)  There are a number of products on the market that compress PDF files (e.g., cvison’s PdfCompressor seems to be good one), but none natively interface directly with Documentum to do it.  Intrigued by this idea, and dissatisfied with having to buy a solution, I set out to determine if I could implement a PDF compression solution using an open source product like iText.

First I created a Documentum job that periodically scanned for PDF files that fell within a date range*.  Once found, I exported them and tried to compress them using iText.  If successful, the compressed PDF was checked back into the repository as the same version.  If not, the checkout was simply cancelled. The most interesting part of the code is the PDF compression code using iText.  It is remarkably simple and does an adequate job on all the PDFs I tested**.  Here is the compression code snippet.

// capture starting file size
  File file = new File(origFile);
  origFileSize = file.length();

  // borrowed from http://itextpdf.com/examples/iia.php?id=218
  PdfReader reader = new PdfReader(origFile);
  if (!reader.isEncrypted()) {
    PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(compFile));

    // set the reader's document into compress mode
    Document.compress = true;

    // read each page and save it -- thus compressing it
    int total = reader.getNumberOfPages() + 1;
    for (int i = 1; i < total; i++) {
      reader.setPageContent(i, reader.getPageContent(i));
    }

    // close reader and save new pdf 
    stamper.close();
    reader.close();

    // get compressed file size 
    file = new File(compFile);
    compFileSize = file.length();

    // if compression happened, return true
    if (origFileSize > compFileSize)
      return true;
    else
      return false;
  }

Like I said, this code does a decent job compressing PDF files, achieving anywhere from 0.01% – 35% reduction in size (see log excerpt below).  It could probably achieve much more if I wrote code to examine each element of the PDF file separately and selected the optimum compression algorithm for each element (e.g., color images, scanned images, formatted text).  I suspect that is what cvison’s PdfCompressor does.  In addition, PDfCompressor can perform OCR on the PDF as part of the compression process.  Perhaps I’ll try that next….

FILENAME : ORIGINAL SIZE ==> COMPRESSED SIZE : PERCENT COMPRESSED
CMIS-v1.1-cos01.pdf : 1340961 not compressed
Content_Server_53_SP6_Release_Notes.pdf : 1826510 ==> 1550075 : 15.13% compressed
doc1.comp.pdf : 2231903 ==> 2231740 : 0.01% compressed
doc2.comp.pdf : 515678 not compressed
doc2.pdf : 515653 not compressed
doc3.comp.pdf : 132544 not compressed
doc3.pdf : 151254 ==> 132544 : 12.37% compressed
docu32924_Documentum-Content-Server-6.7-DQL-Reference.pdf : 1798106 encrypted files cannot be compressed
docu32927_Documentum-Content-Server-6.7-System-Object-Reference.pdf : 3450245 ==> 2231903 : 35.31% compressed
docu32929_Documentum-Content-Server-6.7-Administration-and-Configuration-Guide.pdf : 5381640 encrypted files cannot be compressed
Memo.pdf : 1032281 ==> 913732 : 11.48% compressed
Memo2.pdf : 730887 ==> 697638 : 4.55% compressed
Note.pdf : 419827 ==> 390566 : 6.97% compressed
Memo3.pdf : 892248 ==> 857271 : 3.92% compressed
Response.pdf : 148988 ==> 137494 : 7.71% compressed
Document.pdf : 1278332 ==> 1112985 : 12.93% compressed
Memo2.pdf : 323929 ==> 316612 : 2.26% compressed
Memo3.pdf : 476955 ==> 410189 : 14.00% compressed
Note2.pdf : 273916 ==> 235906 : 13.88% compressed
Memo4.pdf : 9173928 ==> 7594308 : 17.22% compressed
Memo5.pdf : 3500990 ==> 2975918 : 15.00% compressed
Note3.pdf : 220409 ==> 202745 : 8.01% compressed
Memo6.pdf : 57121 ==> 53212 : 6.84% compressed
Letter.pdf : 73766 ==> 67350 : 8.70% compressed

There is obviously more to the code than just the snippet above. The entire Composer project for the PDFCompress job is here if you are interested. Your mileage may vary.

* I considered using an Aspect to flag files that had already been compressed to keep from duplicating effort every time the job ran.  I still might try that.

** I tested on a corpus of 104 files from a variety of sources.  These files included: Word files (with and without images) saved as PDFs; scanned documents (with and without OCR) converted to PDF by Captiva InputAccel; PDF files of unknown origin from the Internet; Documentum documentation; and locked PDFs.

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

3 Responses to PDF Compression Job

  1. Pingback: A Documentum Job Base Class | dm_misc: Miscellaneous Documentum Information

  2. Pingback: Links to All of My Source Code | dm_misc: Miscellaneous Documentum Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: