xPloring Proximity Searching in DFC 6.7SP2

Last year I posted a query to the ECN about wildcard and proximity searches using xPlore.  In part, I was searching for syntax that would allow users to enter two words in WebTop’s  Quick Search box and specify that they should occur within a certain number of words of each other in the text.  For example, users might be interested in any document that contained the word “Walmart” within 10 words of “lawsuit”.  This capability existed long ago in Documentum when the full text search engine was Verity (ca. 4i and 5).

With xPlore v1.3 and DFC v6.7 SP2, EMC Documentum quietly added a proximity search capability to the IDfQueryBuilder class’s IDfExpressionSet class (see Issue DFC-10955).  Queries can now be constructed to locate documents that contain words near each other.  An overview of this capability can be found here.  This capability is not present in DFC v7.0.  At first that seems a little odd, but remember DFC v67 SP2 was published after DFC v7.0.  I assume/hope the capability will be rolled into DFC v7.1 later this year.

Note that it is only possible to implement proximity searching through the use of the IDfSearchService, the IDfQueryBuilder, and the IDfExpressionSet classes; it is not possible to use DQL for this purpose.  A good overview of these classes and how to construct queries using them can be found here.

To experiment with this new capability, I created a set of 5 documents that contained various arrangements of the words ‘Walmart’ and ‘Lawsuit’ (their names identify their specific configuration of the words).  I then wrote a simple test harness to execute various queries for these documents.  The following code snippet represents the guts of the experiment.


// build attr based search
qb1.setDatabaseSearchRequested(true);
qb1.getRootExpressionSet().addSimpleAttrExpression(OBJNAME_ATTR, IDfValue.DF_STRING,
IDfSimpleAttrExpression.SEARCH_OP_CONTAINS, false, false, "Walmart");

// build simple FT search
qb2.setDatabaseSearchRequested(false);
qb2.getRootExpressionSet().addFullTextExpression("Walmart");

// build simple boolean FT search
qb3.setDatabaseSearchRequested(false);
qb3.getRootExpressionSet().addFullTextExpression("Walmart AND lawsuit");

// build FT search with proximity and position
qb4.setDatabaseSearchRequested(false);
qb4.getRootExpressionSet().addFullTextExpression("(Walmart AND lawsuit) NEAR(5)");
PositionalOperator po = new PositionalOperator();
po.setCount(5);
qb4.getRootExpressionSet().setDistance(po);

// build FT search with proximity without position
qb5.setDatabaseSearchRequested(false);
qb5.getRootExpressionSet().addFullTextExpression("(Walmart AND lawsuit) NEAR(5)");

// build FT search with position only
qb6.setDatabaseSearchRequested(false);
qb6.getRootExpressionSet().addFullTextExpression("Walmart AND lawsuit");
qb6.getRootExpressionSet().setDistance(po);

The code produced the following results.

RESULT #1
================================================
Attribute only query
query expression type is=(3) simpleattr – ‘Walmart’
query=SELECT r_object_id,object_name FROM dm_document WHERE (UPPER(object_name) LIKE ‘%WALMART%’ ESCAPE ‘\’) AND (a_is_hidden = FALSE) ENABLE(NOFTDQL)
result set size: 5
* result: Walmart attr test (090f0ff3800069bf) score: 0.7474509477615356
* result: Walmart FT test (090f0ff3800069c0) score: 0.7474509477615356
* result: Walmart and LS FT test (090f0ff3800069c1) score: 0.7394803762435913
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 0.7333883047103882
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.7286628484725952

RESULT #2
================================================
FT single word
query expression type is=(2) fulltext – ‘Walmart’
result set size: 5
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart FT test (090f0ff3800069c0) score: 1.0
* result: Walmart attr test (090f0ff3800069bf) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

RESULT #3
================================================
FT boolean
query expression type is=(2) fulltext – ‘Walmart AND lawsuit’
result set size: 3
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

RESULT #4
================================================
FT near with position
query expression type is=(2) fulltext – ‘(Walmart AND lawsuit) NEAR(5)’
result set size: 2
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.9962001442909241

RESULT #5
================================================
FT near without position
query expression type is=(2) fulltext – ‘(Walmart AND lawsuit) NEAR(5)’
result set size: 2
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.9962001442909241

RESULT #6
================================================
FT position without near
query expression type is=(2) fulltext – ‘Walmart AND lawsuit’
result set size: 3
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

Observations and caveats:

  • The NEAR(x) syntax is only available in Documentum v6.7 SP2.  Contrary to the claim in the article above, I cannot get this to work with DFC v7.  Any DFC other than v6.7 SP2 does not contain the PositionalOperator class and cannot properly parse the NEAR(x) syntax.
  • xQuery must be the query language generated by the DFC.  This is the default case unless you added dfc.search.xquery.generation.enable=false to your dfc.properties file.
  • Stop words are counted when distance is calculated between words.  This is evidenced by results #4 and #5 where the document including ‘Walmart’ and ‘lawsuit’ separated by 10 stop words should have been found, but was not.  If stop words were eliminated from the distance calculation,  ‘Walmart’ and ‘lawsuit’ should have been adjacent.  My implementation of xPlore is out-of-the-box with no tweaks, tunes, filters, or configuration for stop words.
  • My experiment suggests that to do a proximity search using the NEAR(x) syntax, you do not have to use the PositionalOperator class (see results #4 and #5).  I’m not totally sure what to make of that, but to be on the safe side, use both.
  • You can use the NEAR(x) syntax in the Webtop v6.7SP2 Quick Search box too.  For example, entering ‘(Walmart AND lawsuit) NEAR(5))’ returns the same set of objects as results #4 and #5.

I am delighted to see proximity searching making its way back into the capabilities of the Content Server.  It would be great to recapture the capability to search for words in the same sentence, or in the same paragraph, as we were able to do in the days of the Verity search engine.  It would also be great to see this capability become native to DQL; using the IDfSearchService is certainly not my favorite way to construct queries.

Full source code for my experiment can be found here.

UPDATE:  See additional comments on this ECN thread.

Advertisements

About Scott
I have been implementing Documentum solutions since 1997. In 2005, I published a book about developing Documentum solutions for the Documentum Desktop Client (ISBN 0595339689). In 2010, I began this blog as a record of interesting and (hopefully) helpful bits of information related to Documentum, and as a creative outlet.

3 Responses to xPloring Proximity Searching in DFC 6.7SP2

  1. Pingback: “The Basics” Series | dm_misc: Miscellaneous Documentum Information

  2. Pingback: The Basics – Running a Query | dm_misc: Miscellaneous Documentum Information

  3. Pingback: Links to All of My Source Code | dm_misc: Miscellaneous Documentum Information

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: