Selecting Random r_object_ids

This week I had an engineer ask me if there was a way to select a random document from a repository. He was trying to get some performance metrics for a web service by exporting random documents from the repository.

His first approach was to run a query that loaded all eligible r_object_ids into an array and then used Java’s Math.random() method to generate a random index into the array. Using the randomly selected r_object_id, he then exported the document from the repository. On small data sets this  approach worked well. However, it didn’t scale when he moved the performance test to a one million+ object repository; it took too long to build the array of r_object_ids. So, the question arose: is there a way to have the Content Server or the database choose a random r_object_id?

After a little research I was able to randomly select 100 r_object_ids from the database using the following query:

select top 100 d.r_object_id from dm_document_s d, dm_sysobject_s s where s.r_full_content_size > 0 and d.r_object_id = s.r_object_id and s.r_object_type = ‘dm_document’ order by newid();

This query will randomly select 100 dm_document r_object_ids that have content. The key to the query is the ‘order by newid()‘ predicate that implements the randomness. Note this query only works with Microsoft SQL Server. A drawback to this query is that it does not run in DQL; it must to be run directly on Documentum’s database tables from a JDBC connection.

So, with a little modification to the performance metric code, my engineer was able to gather the metrics he needed by randomly selecting objects in a quick and efficient way.

A similar query in Oracle looks something like this, although I cannot test it:

SELECT r_object_id FROM
( SELECT r_object_id FROM dm_document_s d, dm_sysobject_s s where s.r_full_content_size > 0 and d.r_object_id = s.r_object_id and s.r_object_type = ‘dm_document’ ORDER BY dbms_random.value )
WHERE rownum = < 101

Special thanks to Pete Freitag and his blog, www.petefreitag.com/item/466.cfm.

UPDATE:  I found this book excerpt this week that discusses ways to select random rows from tables:  http://media.pragprog.com/titles/bksqla/random.pdf.

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

5 Responses to Selecting Random r_object_ids

  1. mgr2080 says:

    I normally use a DQL statement like this one:
    select * from dm_document where r_object_id like ‘%f’
    (or like ‘%ff’ or like ‘%fff’ etc.)

    This will return to you about 1 in every 16 (or 256 or 4096, etc.) random documents.
    Of course the returned documents will not be random in the sense that their object IDs all end in “f”, but they normally will be random from any functional perspective.

    Like

  2. Matt says:

    Rather than opening / closing a JDBC connection, why not just make a MS SQL Server View and register it as a registered table? This way you could use DQL to query the Registered Table / underlying view.

    -Matt

    Like

    • Scott says:

      That’s a good idea, too. That seems like best of both worlds. You still get to use the randomization of the SQL query and can retrieve the r_object_ids using DQL’s ENABLE(RETURN_TOP 100). Cool.

      Like

  3. doquent says:

    Here is a completely different idea. It picks one random document efficiently using DFC/DQL. If you really needed a set of random documents, and are OK with DB-specific queries then one of the approaches discussed earlier will be better suited.

    The original question was about picking 1 random document, and the randomness could be introduced using attributes such as creation or modification date on the document. In DFC, we could pick a random date in the past – can be done easily in Java. Then we could compose a query to pick a document with creation or modification date (pick one) after the random date – use ENABLE(RETURN_TOP 1). Alternatively, we could create a qualification and use a DFC method that returns just one result.

    Of course, additional desired constraints (content size, etc.) could also be used in conjunction with the randomness condition. As long as one of the conditions is random, we should be good.

    Like

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: