Configuring text tokenizers

When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. Tokenizers splits-up a stream into a series of tokens.

Sybase Search includes several standard text tokenizers that you can use based on your language requirement.

Table 3-41: Standard text tokenizer types

Name

Description

com.isdduk.text.parsing.StdTextTokenizer

StdTextTokenizer extends from BreakIteratorTextTokenizer, and uses java.text.BreakIterator.getWordInstance() for tokenizing sentences.

NoteStdTextTokenizer is suitable for most western languages.

com.isdduk.text.parsing.PreScanBitrTokenizer

PreScanBitrTokenizer extends StdTextTokenizer providing functions that protect user-defined keywords from being destroyed by StdTextTokenizer. Configure defined keywords in PreScanBitrTokenizer.properties.

Sybase Search allows you to configure the text tokenizers by modifying the TextProcessor tag in the TextModule.default.xml file. See “Setting text tokenizer parameters”.