Java Reference
In-Depth Information
Adding a URL
When you first create a Spider object, you are to add one or more URLs to begin
processing. If you do not add a URL, then the spider will have no work. These URLs are add-
ed through the spider's addURL method. Additionally, when the spider finds other URLs,
the spider uses its own addURL method to add URLs. This is helpful because the spider's
addURL method performs checks on the URL to make sure it should be added to the work-
load.
First, the spider checks to see if the URL being added is beyond the specified maximum
depth.
if ((this.options.maxDepth != -1) &&
(depth > this.options.maxDepth)) {
return;
}
Next, the spider makes sure that any filters have not excluded the URL.
for (SpiderFilter filter : this.filters) {
if (filter.isExcluded(url)) {
return;
}
}
Finally, the URL is passed on to the workload manager. If the workload manager returns
true , then the URL was added. The workload manager also performs additional filtering on
URLs. Specifically, if the workload manager determines that the URL has already been found,
then the URL is not reprocessed.
// Add the item.
if (this.workloadManager.add(url, source, depth)) {
StringBuilder str = new StringBuilder();
str.append("Adding to workload: ");
str.append(url);
str.append("(depth=");
str.append(depth);
str.append(")");
logger.fine(str.toString());
}
Finally, if the URL was added, then it is logged.
Processing All Hosts
When the process method is called, the spider begins working. The process
method will not return until the spider has no more work to do. The process method
begins by clearing the cancel flag and recording the starting time for the spider.
Search WWH ::




Custom Search