Similarity Index Aspect v2

I recently tinkered with the solution contained in my EMC Knowledge Sharing Article, Finding Similar Documents Without Using a Full Text Index. With a simple addition to the Aspect code, I managed to eliminate one of the solution’s main shortcomings: that it only worked for crtext content.  The solution to this problem came in the form of Apache Tika.  (Thanks to Lee Grayson for introducing me to Tika.)  Tika is a text-extraction engine originally incubated in the Lucene project.  With the simple addition of the code below, the SIAspect class (see the article for full context) is now able to extract and compare content regardless of format.  This small addition has increased SI’s effectiveness hugely; I only wish I had discovered Tika before submitting of the article.

// SIAspect.java, Line 129
// *** \/\/\/ changed in v2 \/\/\/ ***
	// string1 = cx.ByteArrayInputStreamToString(content);  // "old" method

	// use Tika to extract text from any content
	try {
		Tika tika = new Tika();
		format = tika.detect(content);
		// if tika detects the content is a binary stream, just convert the stream
		// to a string.  Sometimes text files, RTF and PDF files that have
		// extended ASCII chars will cause Tika to treat the file as a binary file
		// and not extract the content.
		if (format.contains("octet-stream") ||
				format.contains("image") ||
				format.contains("audio") ||
				format.contains("video")) {
			string1 = cx.ByteArrayInputStreamToString(content);
		} else {
			// use Tika to extract content
			string1 = tika.parseToString(content);
	        string1 = string1.replaceAll("\\s+", " ");
		}
	} catch (Exception e) {
		// if Tika causes an error, just convert the stream to a string
		string1 = cx.ByteArrayInputStreamToString(content);
	}

// *** ^^^ changed in v2 ^^^ ***

To test the improvement, I modified the test code to use the JODConverter to generate a variety of document formats from my test corpus.  The new project can be downloaded here.

 

Advertisements

Similarity Index Post at Armedia

FYI and ICYMI – I have a blog post at Armedia recapping my EMC Proven Professional Knowledge Sharing article, Finding Similar Documents without Using a Full Text Index.

2014 Knowledge Sharing Article Published

As I mentioned in June, my 2014 EMC Proven Professionals Knowledge Sharing article, Find Similar Documents Without Using A Full Text Index, has been published on the ECN.  Here is the direct link to the PDF.  Please read it, download the code, give it a try, and let me know what you think.

UPDATE:  With the kind consent of EMC, my 2014 Knowledge Sharing article is now available on my Publications page (or here directly) for those of you without an ECN account.

My cert for being a published author!

KS_pub_author_2014_cert

My 2014 Knowledge Sharing Abstract

For the past 8 years, EMC has held an annual Knowledge Sharing Competition among it Proven Professionals.  This past year, I entered the competition with my article, Finding Similar Documents In Documentum Without Using a Full Text Index.  Though I didn’t win, my article was chosen for publication.

Here is the article’s abstract:

This Knowledge Sharing article will discuss how to configure Documentum to enable identification of syntactically similar content without the use of a full text indexing engine. The technique described utilizes a Java Aspect to calculate SimHash values for content objects and stores them in a database view. The database view can then be queried programmatically via an Aspect or by using DQL to identify content similar to a selected object.

Many systems that identify similar content do so by storing a collection of fingerprints (sometimes called a sketch) for each document in a database with other fingerprints. When similar content is requested, these systems apply various algorithms to match the selected content’s fingerprints with those stored in the database. Full text indexing solutions also require databases and index files to store word tokens, stems, synonyms, locations, etc. to facilitate identification of similar content. Some full text search engines can be configured to select the most important words from a document, and build a query using those words to identify similar content in its indexes.

The solution I discuss in the article condenses the salient features of a document into a single, 64-bit hash value that can be attached directly to the content object as metadata, thus eliminating the need for additional databases, indexes, or advanced detection algorithms. Similar content can be detected by simply comparing hash values.

All of the articles selected for publication have been collected into a book of abstracts.  The 2014 book of abstracts can be accessed here (login may be required); mine is on page 41.  My article should be available for download in September 2014.  I will let you know when it is available.

AbstractCover Abstract

 

Similarity Index Experiment, v2.0

In July 2011, I posted a paper that detailed my search for a process that could reduce the essence of a document to a number (the Similarity Index, or SI) so that it could be compared to other such numbers to determine similarity among documents. I was interested in finding a process that produced a number that could be queried using a percent differential (e.g., 90%), to find “similar ” documents in a collection. The goal of the research was to affix the SI value to an object in a content repository as metadata. Then, whenever an object was selected, the repository could easily identify similar objects in the repository by querying for SI values that were within a range of the selected object’s SI. The capability to identify similar content in this manner could augment a content management system’s ability to search, mine, suggest, classify, and de-duplicate content, among other things.

In that first paper I determined that this idea was conceptually viable, though the methodology I developed suffered from some limitations and flaws.  Addressing those flaws has led to this second experiment, SIv2.0.  In SIv2.0 I use a new algorithm (SimHash) and technique (Hamming Distance) to more accurately identify similarity among documents, and show that the process easily scales by using a much larger test corpus.  I found this research to be fun and fascinating, I hope you enjoy the paper.

%d bloggers like this: