INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

this.waiting.remove(url);

setStatus(url, null, URLStatus.Status.PROCESSED, -1);

Finally, the URL's status is set to PROCESSED .

Setting a URL Status

Both the markProcessed and markError methods rely on the setStatus

method to actually set the status for a URL. The setStatus accepts a URL, a status ,

a page source and a page depth . Setting the status and depth is optional. If you

do not wish to affect the source then pass null for source. If you do not wish to affect

depth , then pass negative one for depth .

The setStatus method begins by attempting to access the URLStatus object in

the map for the specified URL. If no status object is found, then one is created.

URLStatus s = this.workload.get(url);

if (s == null) {

s = new URLStatus();

this.workload.put(url, s);

}

s.setStatus(status);

If a value was specified for source , then set the source for the URLStatus object.

if (source != null) {

s.setSource(source);

}

If a value was specified for depth , then set the source for the URLStatus object.

if (depth != -1) {

s.setDepth(depth);

}

The workload manager uses this method internally any time the status is set.

Summary

In Chapter 13 you saw how to use the Heaton Research Spider. In this chapter you saw

how the Heaton Research Spider was constructed. This chapter is intended for those who

want to see the inner workings of the Heaton Research Spider, rather than simply using it.

The spider uses thread pools to work more efficiently. In addition to allowing the spider

to execute more effectively on multi-processor systems, the thread pool allows even a single

processor system to execute more efficiently. This is because the spider spends a great deal

of time waiting. A thread pool allows the spider to be waiting on a large number of URLs at

the same time.

Search WWH ::

Custom Search

Home