When a document is indexed, its individual fields are subject to the analyzing and tokenizing filters that can transform and normalize the data in the fields. For example — removing blank spaces, removing html code, stemming, removing a particular character and replacing it with another. Tokenizers splits-up a stream into a series of tokens.
Sybase Search includes several standard text tokenizers that you can use based on your language requirement.
Name |
Description |
---|---|
com.isdduk.text.parsing.StdTextTokenizer |
StdTextTokenizer extends from BreakIteratorTextTokenizer, and uses java.text.BreakIterator.getWordInstance() for tokenizing sentences.
|
com.isdduk.text.parsing.PreScanBitrTokenizer |
PreScanBitrTokenizer extends StdTextTokenizer providing functions that protect user-defined keywords from being destroyed by StdTextTokenizer. Configure defined keywords in PreScanBitrTokenizer.properties. |
Sybase Search allows you to configure the text tokenizers by modifying the TextProcessor tag in the TextModule.default.xml file. See “Setting text tokenizer parameters”.