Text tokenizers are implemented in pairs, consisting of a non-stateful and a stateful implementation. The non-stateful tokenizer must define the tokenizing algorithm, and the stateful tokenizer must manage a “tokenizing state” (for example, when tokenizing a series of contiguous character buffers). The tokenizers are defined in the following interfaces:
com.isdduk.text.parsing.TextTokenizer
com.isdduk.text.parsing.StatefulTextTokenizer
Sybase Search provides abstract, base classes for each, to simplify the implementation and integration of new tokenizer classes:
com.isdduk.text.parsing.AbstractTextTokenizer
com.isdduk.text.parsing.AbstractStatefulTextTokenizer
Sybase Search also provides additional support for two common tokenization techniques, which are:
tokenizing strings using a java.util.BreakIterator
tokenizing strings using into an array of strings
(java.lang.String[])
For each of these techniques, there are additional sub-interfaces and sub-classes:
com.isdduk.text.parsing.BreakIteratorTextTokenizer
com.isdduk.text.parsing.BitrStatefulTextTokenizer
com.isdduk.text.parsing.StringArrayTextTokenizer
com.isdduk.text.parsing.StrAryStatefulTextTokenizer
This section provides information for developers about developing customized text tokenizers based on your search requirements and language specification.