Parsers are used for processing metadata values, which are generally received as string key/value pairs. Although document body text is processed by the system term splitter and stemmer, metadata often must be handled differently (because metadata values can be not only strings but also numeric and date types). The parsers loaded by the Text Manager are referenced in the metadata field parser and query parser XML configuration files.
There are four types of parsers:
String
Numeric decimal
Numeric integer
Date (time)
You can build custom parsers and plug them into the system if necessary. Table 3-23 shows the attributes for the Parser tag
Attribute |
Default |
Description |
---|---|---|
identifier |
None |
The Parser instance’s identifier. This must be a name and a unique ID separated by an underscore (_). |
class |
None |
The Java implementation class. |
Table 3-24 shows the attributes for the Param tag.
Attribute |
Default |
Description |
---|---|---|
name |
None |
The name of the parameter to pass to the parser. |
value |
None |
The string value to associate with the parameter name. |
Sybase Search comes with the preconfigured parsers, shown in Table 3-25, which are adequate for most common metadata types.
Item |
Description |
|
---|---|---|
Name |
float_1 |
|
Class |
com.isdduk.text.SimpleFloatParser This class parses strings representing decimal numbers into actual decimal numbers. For example, the string “3.142” is parsed into Java float 3.142. |
|
Name |
integer_2 |
|
Class |
com.isdduk.text.IntegerParser This class parses strings representing an integer number into an actual integer number; any floating-point information is discarded. For example, both “3” and “3.142” are parsed into Java int 3. |
|
Name |
dateUK_3 |
|
Class |
com.isdduk.text.DateFormatParser |
|
Name |
dateMs1970_4 |
|
Class |
com.isdduk.text.Ms1970DateParser |
|
Parameter |
Name – roundTo. Value – choose a year, month, day, hour, minute, second, or any other value to denote no rounding should take place. This class is date parser, which effectively parses strings representing long integer (64-bit) numbers, which themselves represent dates as the number of milliseconds since 1 January 1970. The preconfigured instance rounds dates to the nearest day (UTC). |
|
Name |
intB2KB_5 |
|
Class |
com.isdduk.text.B2KBIntParser This class parses strings representing byte-size numbers and converts them into kilobyte-size numbers. For example, the string “2048” (bytes) is parsed as Java int 2 (kilobytes). |
|
Name |
datePDF_6 |
|
Class |
com.isdduk.text.PDFDateParser |
|
Parameter |
Name – roundTo. Value – choose a year, month, day, hour, minute, second, or any other value to denote that no rounding should take place. This class handles the PDF date format, in which dates are formatted “D:20030602143803+01'00'”. The preconfigured instance rounds dates to the nearest day (UTC). |
|
Name |
url_7 |
|
Class |
com.isdduk.text.URLTermParser This class splits URL strings into their constituent elements, namely, protocol, host, port, path, extension, and query. Optionally, each element can be indexed separately. The options parameter of the parser determines the elements that the parser returns. All elements are not indexed, by default. For example, the protocol and port elements are not indexed by default, because their values are usually the same for all URLs, which is http and 80 respectively. Thus, these values are typically not important in URL matching. |
|
Parameter |
Name – options Value – choose the value that is the sum of the bits that represent the elements the URL parser should return:
HOST – 2PORT – 4PATH – 8EXTENSION – 16QUERY – 32 For example, if the URL parser should only return the path and extension URL elements, set the options parameter to 24 (8+16). If this parameter is used for parsing the URL http://www.sybase.com/about/jobs.html, the parser returns "about", "jobs,” and "html." |