Testing Thesauri in xPlore

To follow up on my last post about using custom thesauri with xPlore , I thought I would share a little bit about how to debug thesauri in xPlore, and watch term expansion in progress.

  1. Login to the xPlore Admin Console.
  2. Expand the node:  Home >> Services >> Logging
  3. Click the ‘Configuration’ button in the upper right-hand corner of the Logging screen.
  4. Change the configuration setting for ‘dsearch-search’ to ‘DEBUG’, and click ‘OK’
  5. Ensure a thesarus is loaded in xPlore by expanding node: Home >> Diagnostics and Utilities >> Thesaurus.  You should see at least one thesaurus listed in the ‘Thesaurus List’ pane.  If not, please see the xPlore v1.3 Administration and Development Guide, pages 213-217.
  6. In Webtop, perform a full text search using one of the terms in your thesaurus.  To keep with the theme of my previous post, I searched for ‘tylenol’.
  7. Returning to the xPlore Admin Console, Expand the node:  Home >> Instances >> primaryDsearch >> Logging, and click the ‘desearch’ tab.
  8. Peruse the log file for the following key log entries to see the thesaurus at work:
    1. c.e.d.c.f.common.search.impl.SKOSThesaurusHandler – executing the thesaurus lookup query to get related terms for [tylenol]
    2. c.e.d.c.f.i.services.thesaurus.QueryThesaurus – getTermsFromThesaurus returned related terms [acetaminophen, ibuprofin, Motrin]
    3. 2013-10-05 20:38:09,739 DEBUG [pool-13-thread-10] c.e.d.c.f.indexserver.cps.CPSTokenStreamInBinary – Returned token: tylenol
    4. 2013-10-05 20:38:09,752 DEBUG [pool-13-thread-10] c.e.d.c.f.indexserver.cps.CPSTokenStreamInBinary – Returned token: acetaminophen
    5. 2013-10-05 20:38:09,752 DEBUG [pool-13-thread-10] c.e.d.c.f.indexserver.cps.CPSTokenStreamInBinary – Returned token: ibuprofin
    6. 2013-10-05 20:38:09,753 DEBUG [pool-13-thread-10] c.e.d.c.f.indexserver.cps.CPSTokenStreamInBinary – Returned token: motrin
    7. c.e.d.c.f.i.s.s.impl.AbstractSummaryProcessor – Generated Lucene query for summary: text:tylenol text:acetaminophen text:ibuprofin text:motrin
    8. 2013-10-05 20:38:09,780 DEBUG [pool-13-thread-10] c.e.d.c.f.i.services.summary.impl.SummaryProcessor – formattedText = This is a file about Ibuprofin. Ibuprofin is an analgesic.
    9. 2013-10-05 20:38:09,790 DEBUG [pool-13-thread-10] c.e.d.c.f.i.services.summary.impl.SummaryProcessor – formattedText = This file contains Tylenol.
    10. 2013-10-05 20:38:09,799 DEBUG [pool-13-thread-10] c.e.d.c.f.i.services.summary.impl.SummaryProcessor – formattedText = This file contains acetaminophen

What all that means:

  1. This is xPlore executing the thesaurus look up for ‘tylenol’ to see if it has any entries (<prefLabel>) in the thesaurus
  2. This is the returned list of <altLabels> the thesaurus found for the ‘tylenol’ entry
  3. The thesaurus returning a search token for ‘tylenol’
  4. The thesaurus returning a search token for ‘acetaminophen’
  5. The thesaurus returning a search token for ‘ibuprofin’
  6. The thesaurus returning a search token for ‘motrin’
  7. xPlore is generating the search query using the thesaurus terms
  8. This is a snippet from one of the found documents which contains one of the search terms
  9. This is a snippet from one of the found documents which contains one of the search terms
  10. This is a snippet from one of the found documents which contains one of the search terms

UPDATE:  This post resulted in a derivative post here: http://www.armedia.com/blog/2014/03/expanding-documentums-full-text-search-capability-with-a-thesaurus/.

Using a Custom xPlore Thesaurus

In this post I examine using a custom thesaurus in xPlore to solve three common problems encountered when conducting full text searches:

  1. Accounting for frequently misspelled words.  For example, Walmart is often spelled:  Walmart (correct), Wal-mart, Wal_mart, or Wal Mart. Or, Webtop can be spelled Webtop (correct), WebTop, or Web Top.  Note some of these variations are reduced by doing case-insensitive searches.
  2. Accounting for product names that have changed, for example:  Site Caching Services (SCS) is now Interactive Delivery Services (IDS).
  3. Expanding the scope of a search by including synonyms and related concepts, such as: Tylenol, acetaminophen, ibuprofen.

My first task was to learn about xPlore thesauri and how to create them.  The xPlore v1.3 Administration and Development Guide, pages 213-217, do a good job of explaining what you need to know to create and install thesauri in xPlore.  xPlore thesauri use the SKOS model and are simple to create (I used TextPad) and install, though the lessons I learned below hint at the thought power required to really use thesauri effectively.

Misspelled Words

Here is my first thesaurus to address problem #1.  Initially I created one Concept block for ‘Walmart’ and included all the variations as altLabels.

Walmart RDF

I learned that only the perfLabel values are searched. The altLabels are expansions of the perfLabel. When I only included the first Concept block for ‘Walmart’,  searching on ‘Wal-mart’ did not expand the term to include the other variations. Therefore, to cover all the misspelling possibilities, each term was included as a Concept with the other spellings included as altLabels. Here is a screenshot to illustrate the results.

Walmart search

Product Rename

My approach for solving the second problem (renamed products), was similar to the first. I created a Concept for ‘IDS’ and included ‘SCS’ as an expansion term, and to ensure the reverse was true also, I created a Concept for ‘SCS’ and included ‘IDS’ as an expansion term.  Therefore, anyone searching on one product would find information pertaining to the other product also.  Here is the thesaurus entries for IDS and SCS.

IDS RDF

And here is how that search worked out.

IDS search

Related Concepts

Addressing problem #3 (related concepts) was a lot more interesting and delved into the realm of knowledge management and taxonomies.  To address this problem, I wanted to create the concept of ‘analgesics’ and include in it sub-concepts of ‘generic’ and ‘brand name’ types of medication.  For example, the analgesics concept would include the two generic medications, ‘acetaminophen’ and ‘ibuprofen’, and then two commercial brands of those drugs, ‘Tylenol’ and ‘Motrin’.  Though the SKOS model supports many knowledge management constructs like broader, narrower, and related, xPlore seems limited to just the perfLabel and altLabel tags.  With these limitations, my thesaurus entry for analgesics turned out looking much like the thesaurus entry for the product renaming problem in #2.  To thoroughly flesh-out analgesics, Concept entries would need to be made for each of the altLabels too to provide term expansion on any of the altLabel terms.

Tylenol RDF

Here is an example of the results obtained when searching for ‘tylenol’.

tylenol

These use cases are fairly trivial, but I hope they whet your appetite for what can be done with a simple thesaurus in xPlore.  Other interesting use case might be expanding search terms in one language to include synonyms from another language, or loading an industry-specific thesaurus .

UPDATE:  As is usually the case, after posting an article I stumble upon some more great info.  In this case, two ECN posts by Ed Bueche regarding cool things you can do with the xPlore thesaurus.

UPDATE:  This post also resulted in a derivative post here:  http://www.armedia.com/blog/2014/03/expanding-documentums-full-text-search-capability-with-a-thesaurus/

 

xPloring Proximity Searching in DFC 6.7SP2

Last year I posted a query to the ECN about wildcard and proximity searches using xPlore.  In part, I was searching for syntax that would allow users to enter two words in WebTop’s  Quick Search box and specify that they should occur within a certain number of words of each other in the text.  For example, users might be interested in any document that contained the word “Walmart” within 10 words of “lawsuit”.  This capability existed long ago in Documentum when the full text search engine was Verity (ca. 4i and 5).

With xPlore v1.3 and DFC v6.7 SP2, EMC Documentum quietly added a proximity search capability to the IDfQueryBuilder class’s IDfExpressionSet class (see Issue DFC-10955).  Queries can now be constructed to locate documents that contain words near each other.  An overview of this capability can be found here.  This capability is not present in DFC v7.0.  At first that seems a little odd, but remember DFC v67 SP2 was published after DFC v7.0.  I assume/hope the capability will be rolled into DFC v7.1 later this year.

Note that it is only possible to implement proximity searching through the use of the IDfSearchService, the IDfQueryBuilder, and the IDfExpressionSet classes; it is not possible to use DQL for this purpose.  A good overview of these classes and how to construct queries using them can be found here.

To experiment with this new capability, I created a set of 5 documents that contained various arrangements of the words ‘Walmart’ and ‘Lawsuit’ (their names identify their specific configuration of the words).  I then wrote a simple test harness to execute various queries for these documents.  The following code snippet represents the guts of the experiment.


// build attr based search
qb1.setDatabaseSearchRequested(true);
qb1.getRootExpressionSet().addSimpleAttrExpression(OBJNAME_ATTR, IDfValue.DF_STRING,
IDfSimpleAttrExpression.SEARCH_OP_CONTAINS, false, false, "Walmart");

// build simple FT search
qb2.setDatabaseSearchRequested(false);
qb2.getRootExpressionSet().addFullTextExpression("Walmart");

// build simple boolean FT search
qb3.setDatabaseSearchRequested(false);
qb3.getRootExpressionSet().addFullTextExpression("Walmart AND lawsuit");

// build FT search with proximity and position
qb4.setDatabaseSearchRequested(false);
qb4.getRootExpressionSet().addFullTextExpression("(Walmart AND lawsuit) NEAR(5)");
PositionalOperator po = new PositionalOperator();
po.setCount(5);
qb4.getRootExpressionSet().setDistance(po);

// build FT search with proximity without position
qb5.setDatabaseSearchRequested(false);
qb5.getRootExpressionSet().addFullTextExpression("(Walmart AND lawsuit) NEAR(5)");

// build FT search with position only
qb6.setDatabaseSearchRequested(false);
qb6.getRootExpressionSet().addFullTextExpression("Walmart AND lawsuit");
qb6.getRootExpressionSet().setDistance(po);

The code produced the following results.

RESULT #1
================================================
Attribute only query
query expression type is=(3) simpleattr – ‘Walmart’
query=SELECT r_object_id,object_name FROM dm_document WHERE (UPPER(object_name) LIKE ‘%WALMART%’ ESCAPE ‘\’) AND (a_is_hidden = FALSE) ENABLE(NOFTDQL)
result set size: 5
* result: Walmart attr test (090f0ff3800069bf) score: 0.7474509477615356
* result: Walmart FT test (090f0ff3800069c0) score: 0.7474509477615356
* result: Walmart and LS FT test (090f0ff3800069c1) score: 0.7394803762435913
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 0.7333883047103882
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.7286628484725952

RESULT #2
================================================
FT single word
query expression type is=(2) fulltext – ‘Walmart’
result set size: 5
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart FT test (090f0ff3800069c0) score: 1.0
* result: Walmart attr test (090f0ff3800069bf) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

RESULT #3
================================================
FT boolean
query expression type is=(2) fulltext – ‘Walmart AND lawsuit’
result set size: 3
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

RESULT #4
================================================
FT near with position
query expression type is=(2) fulltext – ‘(Walmart AND lawsuit) NEAR(5)’
result set size: 2
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.9962001442909241

RESULT #5
================================================
FT near without position
query expression type is=(2) fulltext – ‘(Walmart AND lawsuit) NEAR(5)’
result set size: 2
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 0.9962001442909241

RESULT #6
================================================
FT position without near
query expression type is=(2) fulltext – ‘Walmart AND lawsuit’
result set size: 3
* result: Walmart and LS FT test with 5 non stop words (090f0ff3800069c3) score: 1.0
* result: Walmart and LS FT test (090f0ff3800069c1) score: 1.0
* result: Walmart and LS FT test with 10 stop words (090f0ff3800069c2) score: 1.0

Observations and caveats:

  • The NEAR(x) syntax is only available in Documentum v6.7 SP2.  Contrary to the claim in the article above, I cannot get this to work with DFC v7.  Any DFC other than v6.7 SP2 does not contain the PositionalOperator class and cannot properly parse the NEAR(x) syntax.
  • xQuery must be the query language generated by the DFC.  This is the default case unless you added dfc.search.xquery.generation.enable=false to your dfc.properties file.
  • Stop words are counted when distance is calculated between words.  This is evidenced by results #4 and #5 where the document including ‘Walmart’ and ‘lawsuit’ separated by 10 stop words should have been found, but was not.  If stop words were eliminated from the distance calculation,  ‘Walmart’ and ‘lawsuit’ should have been adjacent.  My implementation of xPlore is out-of-the-box with no tweaks, tunes, filters, or configuration for stop words.
  • My experiment suggests that to do a proximity search using the NEAR(x) syntax, you do not have to use the PositionalOperator class (see results #4 and #5).  I’m not totally sure what to make of that, but to be on the safe side, use both.
  • You can use the NEAR(x) syntax in the Webtop v6.7SP2 Quick Search box too.  For example, entering ‘(Walmart AND lawsuit) NEAR(5))’ returns the same set of objects as results #4 and #5.

I am delighted to see proximity searching making its way back into the capabilities of the Content Server.  It would be great to recapture the capability to search for words in the same sentence, or in the same paragraph, as we were able to do in the days of the Verity search engine.  It would also be great to see this capability become native to DQL; using the IDfSearchService is certainly not my favorite way to construct queries.

Full source code for my experiment can be found here.

UPDATE:  See additional comments on this ECN thread.

HTA Monitor Update

I updated the HTA monitor code to include ADTS and xPlore processes.  I also zipped the source code to make it easier to download.  Here is the link.

See original post here.

%d bloggers like this: