Leveraging InputAccel for OCR

A few months back I chronicled my first experience with Captiva InputAccel development. This week, I’d like to supplement that with another experience I recently had. I have a standalone Java/DFC application that extracts TIF files from a Docbase, merges them together in a PDF file using iText, and creates a simple HTML index of all the PDF files created. Recently, the customer asked me to create OCRed PDFs so they could search for words and phrases in each file. I briefly surveyed Google for open source (and proprietary) Java libraries that would OCR TIF or PDF files and wasn’t happy with what I found. Then it occurred to me, Captiva’s NuanceOCR module did exactly what I wanted — and my customer already owned it — it was just a matter of leveraging that existing capability from my Java application.

Long story short: I set up MultiDirectoryWatch (MDW) to watch a folder where my Java application copied PDF files after they were merged. MDW kicked off a batch in InputAccel which performed the OCR and deposited the OCRed file in an output directory watched by my Java application. When the output file arrived, the Java application copied it back to where it belonged.  Simple.

The interesting part of this process, and why I thought it was blog-worthy, has to do with the short InputAccel process used to do the OCR. I had to include two modules and process steps that I found to be unintuitive. Here are the details of the process:

  • MultiDirectoryWatch
    • Level 0
    • Multi:0.Ready = 8
  • Multi
    • Level 1
    • ImageDivider:1.InputFile = MultiDirectoryWatch:1.OutputImage
    • ImageDivider:1.Ready = 1
  • ImageDivider
    • Level 1
    • NuanceOCR:0.Level0_InputImage = ImageDivider:0.OutputFile
  • NuanceOCR
    • Level 1
    • Format 1 = Adobe PDF with image on text
    • Save file to file system = true
    • File = @(MultiDirectoryWatch.OriginalFileName)
    • Overwrite file if it exists

The parts I found to be unintuitive were the Multi and the ImageDivider steps. It turns out, NuanceOCR (and a lot of other InputAccel modules) only process one page at a time. So, when I had MDW pass it a PDF composed of numerous pages of TIFs, it only processed the first page. OK, so using ImageDivider became more obvious after that revelation. But Multi? Turned out that Multi is a utility module that is generally used to restructure the internal InputAccel tree (e.g., create folders/documents/pages, delete folders/page/docs, etc.). It is required for ImageDivider to do its thing.

So, if you ever need a fairly easy and painless way to quickly OCR files, a short InputAccel process like this one may be your answer. The trick is to use Multi and ImageDivider to prepare each page for the OCR module.

Question:  Is there a way to programmatically (i.e., via API) to directly access NuanceOCR without having to create an IA process?

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

4 Responses to Leveraging InputAccel for OCR

  1. Yuri Simione says:

    Interesting post. NuanceOCR is an oem library and I am not sure that one can use that directly. I am speaking just from licensing point of view; technically, the library could be used without problems. The response is in the license agreement that me and you don’t have time to read…..

    Did you use Dfc Operations to export/import files? If so, I think that could be interesting to share your code that could complete your Dfc Operations posts.

    Ciao from Rome, Italy.
    Yuri

    Like

  2. Jed Spink says:

    Spooky, I was just talking to a client on the phone re: doing just this, i.e. using Captiva to run OCR over existing image-only PDFs, and explaining how it could be done. And then you posted this.

    For another client, without Captiva, we developed a process using Adobe Workgroup Server, whereby the PDF image-only documents were queued, as per AutoRender/CTS, we pulled them down using a local process, let Adobe convert PDF to TIFFs, run OCR, reconstitute the document and then we pushed back into Documentum using the local process. As I say pretty much leverage the same methodology used by CTS.

    Like

    • Richard Rock says:

      HI, Using the AdobePDF.java method which lists you as author, circ. 2002. We have recently upgraded Input Accel, and discovered the code does not work as it used to. It does not correctly distinguish between searchable PDF’s and image PDF’s.

      Do you have an updated AdobePDF.java module we can use, or know where we can purchase an updated AdobePDF.java module.

      Thanks for any assistance
      Richard Rock

      Like

  3. You will have to use EMC Captiva PIX tools ….

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: