My 2014 Knowledge Sharing Abstract

For the past 8 years, EMC has held an annual Knowledge Sharing Competition among it Proven Professionals.  This past year, I entered the competition with my article, Finding Similar Documents In Documentum Without Using a Full Text Index.  Though I didn’t win, my article was chosen for publication.

Here is the article’s abstract:

This Knowledge Sharing article will discuss how to configure Documentum to enable identification of syntactically similar content without the use of a full text indexing engine. The technique described utilizes a Java Aspect to calculate SimHash values for content objects and stores them in a database view. The database view can then be queried programmatically via an Aspect or by using DQL to identify content similar to a selected object.

Many systems that identify similar content do so by storing a collection of fingerprints (sometimes called a sketch) for each document in a database with other fingerprints. When similar content is requested, these systems apply various algorithms to match the selected content’s fingerprints with those stored in the database. Full text indexing solutions also require databases and index files to store word tokens, stems, synonyms, locations, etc. to facilitate identification of similar content. Some full text search engines can be configured to select the most important words from a document, and build a query using those words to identify similar content in its indexes.

The solution I discuss in the article condenses the salient features of a document into a single, 64-bit hash value that can be attached directly to the content object as metadata, thus eliminating the need for additional databases, indexes, or advanced detection algorithms. Similar content can be detected by simply comparing hash values.

All of the articles selected for publication have been collected into a book of abstracts.  The 2014 book of abstracts can be accessed here (login may be required); mine is on page 41.  My article should be available for download in September 2014.  I will let you know when it is available.

AbstractCover Abstract

 

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

3 Responses to My 2014 Knowledge Sharing Abstract

  1. Pingback: 2014 Knowledge Sharing Article Published | dm_misc: Miscellaneous Documentum Information

  2. Charles DeRosa says:

    Scott,

    Excellent work! The concept appears very similar to the plagiarism detection (http://en.wikipedia.org/wiki/Plagiarism_detection). It might also relate to the concept of householding algorithms to determine if two people records are similar but not exactly the same (http://analytics.ncsu.edu/sesug/1999/085.pdf).

    You might want to consider a second career in academia as a professor in this area. Very interesting work.

    Best regards,
    Charles DeRosa

    Like

    • Scott says:

      Thanks for the kind words, Charles. I read about numerous plagiarism detection systems while doing the research for this article. The major difference is that those systems store a document’s fingerprints in a database and I wanted to consolidate that information into a single value, the Similarity Index, so documents could be compared independ of a database.

      Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: