Chapter 4: Document Management

Search strategies

A concept search engine performs at its best when you enter queries with search words in context (for example, in short phrases rather than as isolated words). In addition, if you know that more than one language is in use, repeating the concepts using different vocabulary generally improves results. Searching is often an iterative activity with queries being expanded and refined based on the results returned.

This section provides tips for optimizing a concept-based search engine, which provides greater flexibility than traditional approaches to free text searching, such as the Boolean combination of keywords.

For example, a user receives an e-mail message that says:

Following the incident close to Watford railway station in July, we need to assess the damage being done by tree branches tangling in overhead power lines or falling onto the tracks.

The user then wants to locate documents matching the e-mail message. Using a traditional search method, he or she might enter something similar to:

branches AND lines AND tracks

In this query, the user is using the Boolean operator AND to filter the information. This type of query is very precise and is helpful when:

The user knows exactly what information is required, and it can be expressed in just a few words.
There is no ambiguity in the words used in the query.
The vocabulary of the target documents is known precisely.

In practice, this is rarely the case. It is more common that users are unsure of how to formulate their query precisely, thus introducing ambiguity within the query. Differing vocabulary used in documents to describe similar concepts can also result in important documents being missed altogether, and too many irrelevant documents being returned.

If the user is searching a large database of documents, a query like the one in the above example is likely to retrieve a large number of items, many of which are not relevant to the specific query due to the query searching for a small number of specific, isolated words. Words like “branches” and “lines” are ambiguous and are common in a database of documentation concerning the railway system.

A probabilistic search engine like the OmniQ Enterprise Inference Engine is better suited to a query that contains a number of concepts and is expressed using ambiguous language, thus increasing the likelihood the user retrieves results that are relevant to the query.

Using the e-mail example from above, isolate the key concepts, which are:

Damage being done by tree branches
Tangling of overhead power lines
Falling trees and tree branches
Obstruction or damage to tracks

Irrelevant concepts might include:

Watford Railway Station
July

Inclusion of irrelevant concepts unfocuses the search and may introduce some unwanted documents. So, a more effective query is:

damage being done by tree branches, tangling of overhead power lines, falling tree branches, obstruction and damage to tracks

You need not delimit concepts using a comma.

This is a better query because it contains all of the key concepts in the original query and expresses them using words in context. Results returned by this query are likely to produce significantly better results than the first attempt. The best documents are likely to achieve very high relevance scores with the scores falling off rapidly for documents that do not include all of these concepts.

However, it is likely that some relevant documents will still be missed, due to differing vocabulary. Therefore, we could use our knowledge of the environment and expand the original concepts to include variations that we know from experience tend to occur and this may produce a query similar to:

damage being done by tree branches, tangling of overhead power lines, falling tree branches, obstruction and damage to tracks, forestry, wind damage, storm damage, damage to rails, lines being pulled down by trees blown over

At first this may seem more confusing and less precise than the previous examples, but in fact it contains additional ways of defining the original concepts. You may find that no documents achieve a 100% relevance score with this query because no document includes all of these combinations. However, the most relevant documents are at the top of the list.

Often, you can improve search results by feeding back information from documents discovered by the system. For example, if a search produces a document that is relevant but the terminology used in the extracted summary is different from the search text, you may want to expand the original query by appending words or phrases from the document “hit list.” In this way, the search becomes more accurate as you provide additional information.

View this book as PDF