Java Reference
In-Depth Information
Setting up the Database
If you are going to use the SQLWorkloadManager , then you must prepare a da-
tabase to use with the workload manager. The SQLWorkloadManager requires two
tables to be present in the database. They are listed here:
• spider_host
• spider_workload
There can be additional tables; however, the spider will simply ignore them.
The spider_host table keeps a list of hosts that the spider has encountered. The
fields contained in the spider_host table are summarized in Table 13.2.
Table 13.2: The spider_host Table
Field Name SQL Type
Purpose
host_id
int(10)
The primary key for the table.
host
varchar(255) The host name (i.e. www.httprecipes.com)
status
varchar(1)
The status of the host.
urls_done
int(11)
The number of URLs successfully processed for this host.
urls_error
int(11)
The number of URLs that resulted in an error for this host.
The spider_workload table contains a complete list of every URL that the spider
has encountered. The fields contained in the spider_workload table are summarized
in Table 13.3.
Table 13.3: The Spider_workload Table
Field Name SQL Type
Purpose
workload_id
int(10)
The primary key for this table.
host
int(10)
The host id that this URL corresponds to.
url
varchar(2083)
The URL used for this workload element.
status
varchar(1)
This status of this workload element.
depth
int(10)
The depth of this URL.
url_hash
int(11)
A hash code that allows the URL to be quickly
looked up.
source_id
int(11)
The ID of the URL where this URL was found.
Search WWH ::




Custom Search