Web robots

Web robots create and manage their own passive document store. A Web robot crawls the specified Web sites and saves the text content of each page locally. However, it does not save the Web content such as images, JavaScript, and style sheets.

NoteWeb robots may take a considerable time to crawl the target Web sites. Sybase recommends that you configure only one robot per Web site.

StepsCreating a Web robot

  1. Click Document Management.

  2. Click Web Robots.

  3. Click Import from the Web.

  4. Complete these fields:

    Field

    Description

    Main page

    Name

    Name of the Web robot.

    Crawl Now

    Indicates whether the Web robot should begin crawling immediately, or wait until it is scheduled or manually started later.

    Force Refresh

    Indicates whether the Web robot should discard the previously collected URL data.

    When a Web robot crawls a Web site, it stores some of the HTTP response headers of each page it downloads, such as, the status code, Expires, Last-Modified, and ETag headers. This information helps to determine whether the page should be downloaded and crawled again.

    The Force Refresh is selected when you edit the Web robot.

    Web Robot Manager

    Indicates the Web robot manager that hosts the Web robot.

    Passive Document Store Manager

    Indicates the Document store manager to which the Web robot should send its documents for indexing.

    Store Indexed Text

    Indicates whether the raw text from each document is stored within the document store. By default,the option is selected. If you unselect the option, the search results page does not include the View Text link option for each results, as there is no cached text to display.

    URLs page

    Start URLs

    Indicates the URLs where the Web robot starts crawling.

    Link Extractor Patterns

    Indicates that the links of the pages downloaded from URLs that match one of these patterns are extracted and put into the URL (work) queue.

    Regular Expressions

    Indicates whether the patterns should be treated as Java 1.5 regular expressions. A regular expression pattern follows a set of syntax rules to describe or match a set of strings. Go to the Java API Web site

    If this option is not selected, patterns are treated as nonregular expressions. Nonregular expression patterns, which begin with http:// or https:// are considered as “starts with” patterns. All other nonregular expression patterns are considered as “contains string” patterns. For example:

    • http://www.mysite.net – extracts links from all pages.

    • http://www.mysite.net/public/ – extracts links only from pages in the /public directory./public/ – extracts all links that include “/public/” as part of their URL.

    Link Extractor Pattern Exceptions

    Indicates the exceptions to the general rules specified in Link Extractor Patterns.

    Index patterns

    Indicates that the pages downloaded from URLs that match one of these patterns are indexed.

    Index Pattern Exceptions

    Indicates the exceptions to the general rules specified in Index Patterns.

    User Agent page

    User-Agent

    Corresponds to the HTTP User-Agent request header. This value is sent with all HTTP requests.

    Maximum Pages to Download

    Indicates the maximum number of pages the Web robot downloads before it terminates and saves what it has crawled.

    Maximum Crawl Duration

    Indicates the maximum length of time the Web robot spends downloading it terminates and saves what it has crawled. This amount of time may extend into days therefore, you must specify it as an ISO 8601 duration string.

    Maximum Consecutive Failures

    Indicates the maximum number of consecutive failures the Web robot is allowed before it terminates and saves what it has crawled

    Courtesy Timeout

    Indicates the length of time, in seconds, the Web robot waits between successful HTTP requests.

    Error Timeout

    Indicates the length of time, in seconds, the Web robot waits between unsuccessful HTTP requests. Typically, the error timeout is slightly longer than the courtesy timeout, which allows the network and target Web server time to recover before the next attempt.

    Maximum Page Tries

    Indicates the maximum number of times the Web robot attempts to download any Web page. Set to a higher value to enable robots to overcome temporary network or Web server failures.

    Connect Timeout

    Indicates the maximum length of time, in seconds, the Web robot waits to connect to the target Web server.

    Read Timeout

    Indicates the maximum length of time, in seconds, the Web robot waits on a connection to receive a response.

    Ignore Robots.txt

    Robots.txt file contains instructions to prevent Web robots from crawling and indexing certain files and directories on the site.

    Authentication page

    HTTP Authentication

    URL (prefix)

    Prefix to the URLs that require authentication, for example, http://example.net/protected/

    Realm

    Indicates the name of the realm, if applicable. A realm is a database of user names and passwords that identify valid users of a Web application.

    Username

    Indicates the user name required for authentication.

    Password

    Indicates the password required for authentication.

    Confirm Password

    Reenter the password for confirmation.

    Form Authentication

    Action

    The URL that performs the authentication. This is the URL to which the HTML form is submitted.

    Method

    Indicates the request method, either GET or POST.

    User name Form Field

    Field Name

    Indicates the form input field, which represents the user name, for example, user name, uname, or usr.

    Field Value

    Indicates the user name value, for example, jsmith.

    Password Form Field

    Field Name

    Indicates the form input field, which represents the password for example, password, passwd, or pwd.

    Field Value

    Indicates the password value.

    Confirm Password

    Reenter the password for confirmation.

    Misc. page

    Default Page Names

    Indicates the pages names that the Web robot matches with the target Web server’s welcome file list, for example, index.html, index.jsp.

    This enables the Web robot to index only one version:

    • http://example.net/

    • http://example.net/index.html

  5. Click Create to create the Web robot.

StepsEditing a Web robot

  1. Click Document Management.

  2. Click Web Robots.

  3. Select the Web robot you want to edit.

  4. Click Edit. You can change the information in all the fields except Web Robot Manager and Passive Document Store Manager.

  5. Click Update when you have finished making changes.

StepsRemoving a Web robot

  1. Click Document Management.

  2. Click Web Robots.

  3. Select the Web robot you want to remove.

  4. Click Remove then confirm that you want to remove the selected Web robot.

  5. Click OK. Sybase Search removes the Web robot and the associated passive document store.