Java Reference
In-Depth Information
http://www.httprecipes.com/index.php
http://www.httprecipes.com/1/index.php
The above two URLs both specify different Internet resources; however, they are both on
the same host. They both are on the host www.httprecipes.com .
The following two URLs are on different hosts:
http://www.httprecipes.com/index.php
http://www.heatonresearch.com/index.php
Even though both of these URLs access the file index.php , they are both on differ-
ent hosts. The first URL is on the www.httprecipes.com host, and the second is on
the www.heatonresearch.com host.
If any part of the host is different, the two hosts are considered to be different. For
example, www1.heatonresearch.com and www2.heatonresearch.com
are two different hosts.
To support multiple hosts, the SQL workload manager uses the SPIDER_HOST table.
The SPIDER_HOST table keeps a list of all of the hosts that a spider has encountered.
The Heaton Research Spider only processes one host at a time. Once all the URLs from
one host have been processed, the Heaton Research Spider moves on to the next host. Future
versions of the Heaton Research Spider may add an option to mix hosts, but for now, it is one
host at a time. Processing one host at a time makes it easier to work with robots.txt
files, which are used by site owners to restrict portions of their site to spiders. You will learn
more about robots.txt files in Chapter 16, “Well Behaved Bots”.
The workload manager uses two variables to work with multiple hosts. The currentHost
variable tracks the String value of the current host. The currentHostID variable tracks
the table ID of the current host.
Determining Column Sizes
The SQL workload manager scans String columns to determine their size. This al-
lows URL and host names that are too long to be discarded. It is suggested that you always
use a column size of 2,083 for URLs and a column size of 255 for host names. Host names
can be up to 255 characters long, though they are rarely that long. However, these are only
suggestions. You can set the size of these fields to any length, and the spider will adapt, and
truncate as needed.
Search WWH ::




Custom Search