Similarity Index Aspect v2
April 10, 2015 2 Comments
I recently tinkered with the solution contained in my EMC Knowledge Sharing Article, Finding Similar Documents Without Using a Full Text Index. With a simple addition to the Aspect code, I managed to eliminate one of the solution’s main shortcomings: that it only worked for crtext
content. The solution to this problem came in the form of Apache Tika. (Thanks to Lee Grayson for introducing me to Tika.) Tika is a text-extraction engine originally incubated in the Lucene project. With the simple addition of the code below, the SIAspect
class (see the article for full context) is now able to extract and compare content regardless of format. This small addition has increased SI’s effectiveness hugely; I only wish I had discovered Tika before submitting of the article.
// SIAspect.java, Line 129 // *** \/\/\/ changed in v2 \/\/\/ *** // string1 = cx.ByteArrayInputStreamToString(content); // "old" method // use Tika to extract text from any content try { Tika tika = new Tika(); format = tika.detect(content); // if tika detects the content is a binary stream, just convert the stream // to a string. Sometimes text files, RTF and PDF files that have // extended ASCII chars will cause Tika to treat the file as a binary file // and not extract the content. if (format.contains("octet-stream") || format.contains("image") || format.contains("audio") || format.contains("video")) { string1 = cx.ByteArrayInputStreamToString(content); } else { // use Tika to extract content string1 = tika.parseToString(content); string1 = string1.replaceAll("\\s+", " "); } } catch (Exception e) { // if Tika causes an error, just convert the stream to a string string1 = cx.ByteArrayInputStreamToString(content); } // *** ^^^ changed in v2 ^^^ ***
To test the improvement, I modified the test code to use the JODConverter to generate a variety of document formats from my test corpus. The new project can be downloaded here.