2014 Knowledge Sharing Article Published

As I mentioned in June, my 2014 EMC Proven Professionals Knowledge Sharing article, Find Similar Documents Without Using A Full Text Index, has been published on the ECN.  Here is the direct link to the PDF.  Please read it, download the code, give it a try, and let me know what you think.

UPDATE:  With the kind consent of EMC, my 2014 Knowledge Sharing article is now available on my Publications page (or here directly) for those of you without an ECN account.

My cert for being a published author!

KS_pub_author_2014_cert

Advertisements

My 2014 Knowledge Sharing Abstract

For the past 8 years, EMC has held an annual Knowledge Sharing Competition among it Proven Professionals.  This past year, I entered the competition with my article, Finding Similar Documents In Documentum Without Using a Full Text Index.  Though I didn’t win, my article was chosen for publication.

Here is the article’s abstract:

This Knowledge Sharing article will discuss how to configure Documentum to enable identification of syntactically similar content without the use of a full text indexing engine. The technique described utilizes a Java Aspect to calculate SimHash values for content objects and stores them in a database view. The database view can then be queried programmatically via an Aspect or by using DQL to identify content similar to a selected object.

Many systems that identify similar content do so by storing a collection of fingerprints (sometimes called a sketch) for each document in a database with other fingerprints. When similar content is requested, these systems apply various algorithms to match the selected content’s fingerprints with those stored in the database. Full text indexing solutions also require databases and index files to store word tokens, stems, synonyms, locations, etc. to facilitate identification of similar content. Some full text search engines can be configured to select the most important words from a document, and build a query using those words to identify similar content in its indexes.

The solution I discuss in the article condenses the salient features of a document into a single, 64-bit hash value that can be attached directly to the content object as metadata, thus eliminating the need for additional databases, indexes, or advanced detection algorithms. Similar content can be detected by simply comparing hash values.

All of the articles selected for publication have been collected into a book of abstracts.  The 2014 book of abstracts can be accessed here (login may be required); mine is on page 41.  My article should be available for download in September 2014.  I will let you know when it is available.

AbstractCover Abstract

 

Similarity Index Experiment, v2.0

In July 2011, I posted a paper that detailed my search for a process that could reduce the essence of a document to a number (the Similarity Index, or SI) so that it could be compared to other such numbers to determine similarity among documents. I was interested in finding a process that produced a number that could be queried using a percent differential (e.g., 90%), to find “similar ” documents in a collection. The goal of the research was to affix the SI value to an object in a content repository as metadata. Then, whenever an object was selected, the repository could easily identify similar objects in the repository by querying for SI values that were within a range of the selected object’s SI. The capability to identify similar content in this manner could augment a content management system’s ability to search, mine, suggest, classify, and de-duplicate content, among other things.

In that first paper I determined that this idea was conceptually viable, though the methodology I developed suffered from some limitations and flaws.  Addressing those flaws has led to this second experiment, SIv2.0.  In SIv2.0 I use a new algorithm (SimHash) and technique (Hamming Distance) to more accurately identify similarity among documents, and show that the process easily scales by using a much larger test corpus.  I found this research to be fun and fascinating, I hope you enjoy the paper.

%d bloggers like this: