INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

Clearing the Workload

Clearing the workload is easy. The workload and waiting variables are cleared

and the workingCount is set to zero.

this.workload.clear();

this.waiting.clear();

this.workingCount = 0;

The resume method is not implemented because the memory workload will not per-

sist its data between runs of the program. There will be nothing to resume.

Getting the Depth of a URL

An important aspect of the workload management is to track the depth of each URL en-

countered. The depth of the URL was stored when the URL was added to the workload. To

determine the URL of a workload, the URLStatus is read from the map.

URLStatus s = this.workload.get(url);

assert (s != null);

if (s != null) {

return s.getDepth();

} else {

return 1;

}

An assert is used to ensure that the URL is found. If the spider is seeking the depth

of a URL that has not been added yet, that is an error.

Getting the Source of a URL

Along with the depth of a URL, the source of a URL is also tracked. The source of a URL

is the page the URL was found on. Finding the source of a URL is very similar to finding the

depth of a URL. It is read from the URLStatus entry in the map.

URLStatus s = this.workload.get(url);

if (s == null) {

return null;

} else {

return s.getSource();

}

If a URL status is not found for the specified URL, then null is returned.

Search WWH ::

Custom Search

Home