Metadata parsers

Metadata parsers are used to process metadata values, which are received as strings. Although document body text is processed by the system text tokenizer and stemmer, metadata must often be handled differently, because metadata string values can be numeric and date type.

There are four types of metadata parsers:

Sybase Search includes these preconfigured metadata parsers – each requires an identifier that consists of two parts, a name and an unique ID.

Table 2-7: Preconfigured parsers

Item

Description

Name

float_1

Class

com.isdduk.text.SimpleFloatParser

This class parses strings representing decimal numbers into actual decimal numbers. For example, this parser processes the string “3.142” into Java float 3.142.

Name

integer_2

Class

com.isdduk.text.IntegerParser

This class parses strings representing an integer number into an actual integer number; any floating-point information is discarded. For example, this parser processes both “3” and “3.142” into Java int 3.

Name

dateUK_3

Class

com.isdduk.text.DateFormatParser

Name

dateMs1970_4

Class

com.isdduk.text.Ms1970DateParser

Parameter

roundTo

Value – choose a year, month, day, hour, minute, second, or any other value to indicate that no rounding should take place.

This class is a date parser, which parses strings representing long integer (64-bit) numbers, which themselves represent dates as the number of milliseconds since 1 January 1970. The preconfigured instance rounds dates to the nearest day (Coordinated Universal Time).

Name

intB2KB_5

Class

com.isdduk.text.B2KBIntParser

This class parses strings representing byte-size numbers and converts them into kilobyte-sized numbers. For example, the string “2048” (bytes) is parsed as Java int 2 (kilobytes).

Name

datePDF_6

Class

com.isdduk.text.PDFDateParser

Parameter

roundTo

Value – choose a year, month, day, hour, minute, second, or any other value to denote that no rounding should take place.

This class handles the PDF date format, in which dates are formatted “D:20030602143803+01'00'”. The preconfigured instance rounds dates to the nearest day (UTC).

Name

url_7

Class

com.isdduk.text.URLTermParser

This class splits URL strings into their constituent elements, namely, protocol, host, port, path, extension, and query. Optionally, each element can be indexed separately. The options parameter determines the elements that the parser returns. By default not all elements are not indexed. For example, the values for protocol and port elements, http and 80, respectively, are usually the same for all URLs and hence are not indexed by default.

Parameter

options

Value – choose the value that is the sum of the bits that represent the elements the URL parser should return:

  • PROTOCOL – 1

  • HOST – 2

  • PORT – 4

  • PATH – 8

  • EXTENSION – 16

  • QUERY – 32

For example, for the parser to return the path and extension URL elements, set the options parameter to 24 (8+16). If you then use this parameter to parse, for example, http://www.mysite.com/about/jobs.html, the parser returns “about”, “jobs”, and “html.”

Name

int2int

Class

com.isdduk.text.Int2IntParser

This class parses strings representing integer numbers and factors the integer value using operators.

Parameter

  • operator

    Value – can be any these values:

    • +

    • -

    • *

    • /

  • factor

    Value – an integer value that works on the original integer using any of the operators

    For example, if the operator value is “/” and factor value is “1024” the result is similar in outcome to the B2KBIntParser parser.

StepsAdding new metadata parsers

You can create new metadata parsers. The system generates a unique integer ID for each new elements that form part of the parser identifier.

  1. Click Configuration.

  2. Click Metadata Parsers.

  3. Click Add a new metadata parser.

  4. Complete these fields:

    Name

    Description

    Parser Name

    Name of the parser instance.

    Implement Class

    Java implementation class.

  5. If your metadata parser requires special parameters, click Add, else proceed to step 6. Complete these fields.

    Name

    Description

    Name

    The name of the parameter to pass to the parser.

    Value

    The string value to associate with the parameter name.

  6. Click Create.

StepsEditing metadata parsers

You can edit metadata parsers only if it is not being used anywhere including in both query parsers and metadata fields that references the metadata parser.

  1. From the Metadata Configuration Summary page, click Metadata Parsers.

  2. Click Edit for the parser that you want to change.

  3. Make the changes and click Save Changes.

StepsRemoving metadata parsers

You can remove metadata parsers only if it is not being used anywhere including in both query parsers and metadata fields that references the metadata parser.

  1. From the Metadata Configuration Summary page, click Metadata Parsers.

  2. Click Remove for the parser that you want to delete.

  3. Click OK to confirm the removal.