Developing custom text tokenizers

Text tokenizers are implemented in pairs, consisting of a non-stateful and a stateful implementation. The non-stateful tokenizer must define the tokenizing algorithm, and the stateful tokenizer must manage a “tokenizing state” (for example, when tokenizing a series of contiguous character buffers). The tokenizers are defined in the following interfaces:

Sybase Search provides abstract, base classes for each, to simplify the implementation and integration of new tokenizer classes:

Sybase Search also provides additional support for two common tokenization techniques, which are:

For each of these techniques, there are additional sub-interfaces and sub-classes:

This section provides information for developers about developing customized text tokenizers based on your search requirements and language specification.