The Similarity Index

Recently, an intriguing question was raised on the EDN. The gist of the question was, during the checkin of a document, is there a quick and easy way to determine if a similar document already exists in the repository? This notion of identifying similar documents sparked an idea in my mind.

Was it possible to calculate a hash value for a document that captured its salient characteristics, such that a repository can be queried for like values and retrieve all “similar” documents? If so, similar documents could be easily identified by a simple DQL query without the need for a full text search engine. Such a value would allow systems to quickly identify duplicate or similar content before it is checked into a repository, introduced to an index, or returned in a query result. Additionally, this value could assist with identifying other content a user might be interested in, though they did not explicitly query for it. Since this hash value is a single number or string, it can easily be attached to an object as metadata.

Over the past few months I have been doing some research and experimentation on this topic. I discovered that it is possible to calculate a hash value — the Similarity Index — as described above. You can find my research paper on the Research tab above, here, or on the EDN.

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

One Response to The Similarity Index

  1. Pingback: Similarity Index Experiment, v2.0 | dm_misc: Miscellaneous Documentum Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: