Java Reference
In-Depth Information
http://www.httprecipes.com/index.php
http://www.httprecipes.com/1/index.php
The above two URLs both specify different Internet resources; however, they are both on
the same host. They both are on the host
www.httprecipes.com
.
The following two URLs are on different hosts:
http://www.httprecipes.com/index.php
http://www.heatonresearch.com/index.php
Even though both of these URLs access the file
index.php
, they are both on differ-
ent hosts. The first URL is on the
www.httprecipes.com
host, and the second is on
the
www.heatonresearch.com
host.
If any part of the host is different, the two hosts are considered to be different. For
example,
www1.heatonresearch.com
and
www2.heatonresearch.com
are two different hosts.
To support multiple hosts, the SQL workload manager uses the
SPIDER_HOST
table.
The
SPIDER_HOST
table keeps a list of all of the hosts that a spider has encountered.
The Heaton Research Spider only processes one host at a time. Once all the URLs from
one host have been processed, the Heaton Research Spider moves on to the next host. Future
versions of the Heaton Research Spider may add an option to mix hosts, but for now, it is one
host at a time. Processing one host at a time makes it easier to work with
robots.txt
files, which are used by site owners to restrict portions of their site to spiders. You will learn
more about
robots.txt
files in Chapter 16, “Well Behaved Bots”.
The workload manager uses two variables to work with multiple hosts. The
currentHost
variable tracks the String value of the current host. The
currentHostID
variable tracks
the table ID of the current host.
Determining Column Sizes
The SQL workload manager scans
String
columns to determine their size. This al-
lows URL and host names that are too long to be discarded. It is suggested that you always
use a column size of 2,083 for URLs and a column size of 255 for host names. Host names
can be up to 255 characters long, though they are rarely that long. However, these are only
suggestions. You can set the size of these fields to any length, and the spider will adapt, and
truncate as needed.