Java Reference
In-Depth Information
Adding a URL
When you first create a
Spider
object, you are to add one or more URLs to begin
processing. If you do not add a URL, then the spider will have no work. These URLs are add-
ed through the spider's
addURL
method. Additionally, when the spider finds other URLs,
the spider uses its own
addURL
method to add URLs. This is helpful because the spider's
addURL
method performs checks on the URL to make sure it should be added to the work-
load.
First, the spider checks to see if the URL being added is beyond the specified maximum
depth.
if ((this.options.maxDepth != -1) &&
(depth > this.options.maxDepth)) {
return;
}
Next, the spider makes sure that any filters have not excluded the URL.
for (SpiderFilter filter : this.filters) {
if (filter.isExcluded(url)) {
return;
}
}
Finally, the URL is passed on to the workload manager. If the workload manager returns
true
, then the URL was added. The workload manager also performs additional filtering on
URLs. Specifically, if the workload manager determines that the URL has already been found,
then the URL is not reprocessed.
// Add the item.
if (this.workloadManager.add(url, source, depth)) {
StringBuilder str = new StringBuilder();
str.append("Adding to workload: ");
str.append(url);
str.append("(depth=");
str.append(depth);
str.append(")");
logger.fine(str.toString());
}
Finally, if the URL was added, then it is logged.
Processing All Hosts
When the
process
method is called, the spider begins working. The
process
method will not return until the spider has no more work to do. The
process
method
begins by clearing the
cancel
flag and recording the starting time for the spider.