Similarity Index Aspect v2

I recently tinkered with the solution contained in my EMC Knowledge Sharing Article, Finding Similar Documents Without Using a Full Text Index. With a simple addition to the Aspect code, I managed to eliminate one of the solution’s main shortcomings: that it only worked for crtext content.  The solution to this problem came in the form of Apache Tika.  (Thanks to Lee Grayson for introducing me to Tika.)  Tika is a text-extraction engine originally incubated in the Lucene project.  With the simple addition of the code below, the SIAspect class (see the article for full context) is now able to extract and compare content regardless of format.  This small addition has increased SI’s effectiveness hugely; I only wish I had discovered Tika before submitting of the article.

// SIAspect.java, Line 129
// *** \/\/\/ changed in v2 \/\/\/ ***
	// string1 = cx.ByteArrayInputStreamToString(content);  // "old" method

	// use Tika to extract text from any content
	try {
		Tika tika = new Tika();
		format = tika.detect(content);
		// if tika detects the content is a binary stream, just convert the stream
		// to a string.  Sometimes text files, RTF and PDF files that have
		// extended ASCII chars will cause Tika to treat the file as a binary file
		// and not extract the content.
		if (format.contains("octet-stream") ||
				format.contains("image") ||
				format.contains("audio") ||
				format.contains("video")) {
			string1 = cx.ByteArrayInputStreamToString(content);
		} else {
			// use Tika to extract content
			string1 = tika.parseToString(content);
	        string1 = string1.replaceAll("\\s+", " ");
		}
	} catch (Exception e) {
		// if Tika causes an error, just convert the stream to a string
		string1 = cx.ByteArrayInputStreamToString(content);
	}

// *** ^^^ changed in v2 ^^^ ***

To test the improvement, I modified the test code to use the JODConverter to generate a variety of document formats from my test corpus.  The new project can be downloaded here.

 

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

2 Responses to Similarity Index Aspect v2

  1. Pingback: Links to All of My Source Code | dm_misc: Miscellaneous Documentum Information

  2. Pingback: Links to All of My Source Code | dm_misc: Miscellaneous Documentum Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: