Developing custom text tokenizers

Text tokenizers are implemented in pairs, consisting of a non-stateful and a stateful implementation. The non-stateful tokenizer must define the tokenizing algorithm, and the stateful tokenizer must manage a “tokenizing state” (for example, when tokenizing a series of contiguous character buffers). The tokenizers are defined in the following interfaces:

com.isdduk.text.parsing.TextTokenizer
com.isdduk.text.parsing.StatefulTextTokenizer

Sybase Search provides abstract, base classes for each, to simplify the implementation and integration of new tokenizer classes:

com.isdduk.text.parsing.AbstractTextTokenizer
com.isdduk.text.parsing.AbstractStatefulTextTokenizer

Sybase Search also provides additional support for two common tokenization techniques, which are:

tokenizing strings using a java.util.BreakIterator
tokenizing strings using into an array of strings (java.lang.String[])

For each of these techniques, there are additional sub-interfaces and sub-classes:

com.isdduk.text.parsing.BreakIteratorTextTokenizer
com.isdduk.text.parsing.BitrStatefulTextTokenizer
com.isdduk.text.parsing.StringArrayTextTokenizer
com.isdduk.text.parsing.StrAryStatefulTextTokenizer

This section provides information for developers about developing customized text tokenizers based on your search requirements and language specification.