USING A SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

/**

* Called when the spider tries to process a URL but gets

* an error. This method is not used in this manager.

*

* @param url

* The URL that generated an error.

*/

public void spiderURLError(URL url) {

}

The unique functionality with the world spider is the way that it handles new URLs when

spiderFoundURL is called. Unlike the previous spiders, no checks are made to deter-

mine if the URL is on the same host. Any URL is a candidate to be visited.

public boolean spiderFoundURL(URL url, URL source,

SpiderReportable.URLType type) {

return true;

}

As you can see, the spiderFoundURL simply returns true .

This spider shows how you would setup a spider that would access a large number of web

sites. Of course, this spider is only the beginning of a search engine; but it does demonstrate

how to configure the Heaton Research Spider to access a large amount of sites.

Recipe #13.4: Display Spider Statistics

Because the SQLWorkloadManager class stores the workload in a database, it is

possible for other programs to monitor the progress of the spider. This recipe will show you

how to create a simple program that monitors the spider progress using the Heaton Research

spider database.

This recipe makes use of a Heaton Research Spider configuration file, just like previous

recipes. To start this recipe, specify the name of the configuration file as the first argument.

The following code demonstrates how you might start the spider:

SpiderStats spider.conf c:\temp\ http://www.example.com

The above command simply shows the abstract format to call this recipe, with the ap-

propriate parameters. For exact information on how to run this recipe refer to Appendix B,

C, or D, depending on the operating system you are using. Figure 13.1 shows this program

monitoring a spider's progress.

Search WWH ::

Custom Search

Home