In July 2011, I posted a paper that detailed my search for a process that could reduce the essence of a document to a number (the Similarity Index, or SI) so that it could be compared to other such numbers to determine similarity among documents. I was interested in finding a process that produced a number that could be queried using a percent differential (e.g., 90%), to find “similar ” documents in a collection. The goal of the research was to affix the SI value to an object in a content repository as metadata. Then, whenever an object was selected, the repository could easily identify similar objects in the repository by querying for SI values that were within a range of the selected object’s SI. The capability to identify similar content in this manner could augment a content management system’s ability to search, mine, suggest, classify, and de-duplicate content, among other things.

In that first paper I determined that this idea was conceptually viable, though the methodology I developed suffered from some limitations and flaws.  Addressing those flaws has led to this second experiment, SIv2.0.  In SIv2.0 I use a new algorithm (SimHash) and technique (Hamming Distance) to more accurately identify similarity among documents, and show that the process easily scales by using a much larger test corpus.  I found this research to be fun and fascinating, I hope you enjoy the paper.


I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

  Pingback: Links to All of My Source Code | dm_misc: Miscellaneous Documentum Information

