Java Reference
In-Depth Information
There is no need to record the IOException that was caught, since it would most
likely be caused by a timeout on the web server and not by a missing page. Timeouts are
caused by a variety of events. One of which, is an overloaded web server. A missing page
would throw an exception when it is first opened, not during the transfer of information.
Therefore, since a timeout is only a temporary server issue, we do not record that page as a
bad link.
The Heaton Research Spider calls the spiderURLError method whenever a bad
URL is found. This URL is displayed, along with the page it was found on, and added to the
bad list.
URL source;
try {
source = this.workloadManager.getSource(url);
StringBuilder str = new StringBuilder();
str.append("Bad URL:");
str.append(url.toString());
str.append(" found at ");
str.append(source.toString());
this.bad.add(str.toString());
} catch (WorkloadException e) {
e.printStackTrace();
}
These bad URLs are accumulated in the LinkReport class until the spider finished.
Then the bad URL list is displayed.
Recipe #13.2: Downloading HTML and Images
Another common use for spiders, is to create an offline copy of a web site. This recipe
will show how to do this. To start this spider, you must provide three arguments. The first
argument is the name of spider configuration file. Through the spider configuration file, you
can specify to use either an SQL or memory based workload manager. Listing 13.1 shows an
example spider configuration file. Next, a local directory must be specified to which you will
download the site. Finally, the starting URL must be given.
The following shows how you might start the spider:
DownloadSite spider.conf c:\temp\ http://www.example.com
The above command simply shows the abstract format to call this recipe, with the appro-
priate parameters. For exact information on how to run this recipe refer to Appendix B, C, or
D, depending on the operating system you are using. Now that you have seen how to use the
download spider, we will examine how it was constructed.
Search WWH ::




Custom Search