Creating, editing, and removing Web robots

Web robots create and manage their own passive document store. The Web robot crawls the specified Web sites and saves textual content of each page locally. However, it does not save the Web content such as images, JavaScript, and style sheets.

NoteDepending on the Web robot configuration and size of the Web site to index, Web robots may take a considerable time to crawl the target Web sites. Sybase recommends that you configure only one Web robot per Web site.

StepsCreating a Web robot

  1. Click Document Management. The Document Stores Summary page appears.

  2. Click Web Robots. The Web Robots page appears.

  3. Click Import from the Web. The Create Web Robot page appears.

  4. Complete these fields:

    Field

    Description

    Main

    Name

    Name of the Web robot.

    Crawl Now

    Indicates whether the Web robot should begin crawling immediately or wait until it is scheduled, or manually started later.

    Force Refresh

    Indicates whether the Web robot should discard the previously collected URL data and start a fresh crawling.

    When a Web robot crawls a Web site, it stores some of the HTTP response headers of each page it downloads, such as, the status code, Expires, Last-Modified, and ETag headers. This information helps to determine whether the page needs re-downloading. So, the re-crawl process becomes more efficient.

    The Force Refresh check box is enabled when you edit the Web robot.

    Web Robot Manager

    Indicates the Web Robot Manager that hosts the Web robot.

    Passive Document Store Manager

    Indicates the Document Store Manager to which the Web robot should send its crawled documents for indexing.

    URLs

    Start URLs

    Indicates the URLs the Web robot will visit first.

    Link extractor patterns

    Indicates that the links of the pages downloaded from URLs that match one of these patterns are extracted and put into the URL (work) queue.

    Regular expressions

    Indicates whether the patterns should be treated as Java 1.5 regular expressions. A regular expression pattern follows a set of syntax rules to describe or match a set of strings. For more information go to the Java API Web site.

    If this check box is not selected, patterns are treated as non-regular expressions. Non-regular expression patterns, which begin with http:// or https:// are considered as “starts with” patterns. All other non-regular expression patterns are considered as “contains string” patterns. For example:

    • http://www.mysite.net – extracts links from all pages.

    • http://www.mysite.net/public/ – only extracts links from pages in the /public directory./public/ – extracts all links that include “/public/” as part of their URL.

    Link extractor pattern exceptions

    Indicates the exceptions to the general rules specified in Link extractor patterns.

    Index patterns

    Indicates that the pages downloaded from URLs that match one of these patterns are indexed.

    Index pattern exceptions

    Indicates the exceptions to the general rule(s) specified in Index patterns.

    User Agent

    User-Agent

    Corresponds to the HTTP User-Agent request header. This value is sent with all HTTP requests.

    Maximum pages to download

    Indicates the maximum number of pages the Web robot will download before auto-terminating and saving what it has crawled so far.

    Maximum crawl duration

    Indicates the maximum length of time the Web robot will spend downloading before auto-terminating and saving what it has crawled so far. Since this amount of time may extend into days, it must be specified as an ISO 8601 Duration string.

    Maximum consecutive failures

    Indicates the maximum number of consecutive failures the Web robot will handle before auto-terminating and saving what it has crawled so far.

    Courtesy timeout

    Indicates the length of time, in seconds, the Web robot will wait between successful HTTP requests.

    Error timeout

    Indicates the length of time, in seconds, the Web robot will wait between unsuccessful HTTP requests. This is typically slightly longer than the courtesy timeout to allow the network and target Web server time to recover before the next attempt.

    Maximum page tries

    Indicates the maximum number of times the Web robot will attempt to download any Web page. Setting to a higher value enables Web robots to overcome temporary network or Web server failures.

    Connect timeout

    Indicates the maximum length of time, in seconds, the Web robot will wait to connect to the target Web server.

    Read timeout

    Indicates the maximum length of time, in seconds, the Web robot will wait on a connection to receive a response.

    Authentication

    HTTP Authentication

    URL (prefix)

    Indicates the prefix to the URLs that require authentication, for example, http://example.net/protected/

    Realm

    Indicates the name of the realm, if applicable.

    Username

    Indicates the username required for authentication.

    Password

    Indicates the password required for authentication.

    Confirm password

    Re-enter the password for confirmation.

    Form Authentication

    Action

    Indicates the URL, which performs the authentication. This is the URL where the HTML form is submitted.

    Method

    Indicates the request method, either GET or POST.

    Username Form Field

    Field name

    Indicates the form input field, which represents the username, for example, username, uname, or usr.

    Field value

    Indicates the username value, for example, jsmith.

    Password Form Field

    Field name

    Indicates the form input field, which represents the password for example, password, passwd, or pwd.

    Field value

    Indicates the password value.

    Confirm password

    Re-enter the password for confirmation.

    Misc.

    Default page names

    Indicates the pages names, which the Web robot expects will match the target Web server’s welcome file list, for example, index.html, index.jsp.

    This enables the Web robot to guess that the following URLs are equivalent and only one version should be indexed:

    • http://example.net/

    • http://example.net/index.html

  5. Click Create to create the Web robot. The Web robot is created and the Web Robot Information page appears.

StepsEditing a Web robot

  1. Click Document Management. The Document Stores Summary page appears.

  2. Click Web Robots. The Web Robots page appears.

  3. Select the Web robot you want to edit. The Web Robot Information page appears, displaying the details of the selected Web robot.

  4. Click Edit. The Edit Web Robot page appears. You can change the information in all the fields except Web Robot Manager and Passive Document Store Manager.

  5. Make the required changes and click Update. Sybase Search saves the changes and returns you to the Web Robot Information page.

StepsRemoving a Web robot

  1. Click Document Management. The Document Stores Summary page appears.

  2. Click Web Robots. The Web Robots page appears.

  3. Select the Web robot you want to remove. The Web Robot Information page appears, displaying the details of the selected Web robot.

  4. Click Remove. You are prompted to confirm whether you want to remove the selected Web robot.

  5. Click OK. Sybase Search removes the Web robot and the associated passive document store and returns you to the Web Robots page.