Developing and configuring custom text tokenizers

All values for document body text and textual metadata (excluding file paths) are passed through the configured text tokenizers to be broken into individual terms. Each term that is not preserved, not a stopword, and is neither too short nor too long, is passed to the configured term stemmer to be reduced to its root form. Both the text tokenizers and term stemmer can be reimplemented and reconfigured where necessary. See:

Text tokenizing converts extracted plain text into words. Term stemming reduces words to their common roots. Text tokenizing and term stemming are language-specific; therefore, for optimum performance, when you know documents and searches are to be performed in a single language, you can customize the text tokenizer and term stemmer algorithm to make best use of the language.

For example, an English stemming algorithm converts “singing,” “sings,” and “singer” to the stem “sing”; however, this algorithm is not appropriate for French or Chinese.

The default tokenizer class com.isdduk.text.parsing.StdTextTokenizer handles all double-byte characters by using the underlying default Java class java.text.BreakIterator. The Java BreakIterator class uses punctuation and word delimiters to split single-byte languages into words. For double-byte languages, however, the Java BreakIterator class samples the glyphs (for example, Chinese and Japanese characters) in pairs and tries to determine where the end of the words are likely to be.

If you intend to run Sybase Search with documents containing glyph-based languages, Sybase recommends that you write your own custom text tokenizer. Term splitting algorithms designed for a single language should out-perform the Java BreakIterator, which is designed to handle multiple languages.