USING A SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

try {

source = this.workloadManager.getSource(url);

StringBuilder str = new StringBuilder();

str.append("Bad URL:");

str.append(url.toString());

str.append(" found at ");

str.append(source.toString());

this.bad.add(str.toString());

} catch (WorkloadException e) {

e.printStackTrace();

}

The ReportLinks class implements all of the functions and methods defined by the

SpiderReportable interface. To review what these methods and functions are for,

refer to Table 13.1.

The foundURL function is called each time a new URL is found. Because this spider

only operates on a single web server, the foundURL method ensures that all new URLs

are on the same server.

if ((this.base != null) &&

(!this.base.equalsIgnoreCase(url.getHost()))) {

return false;

}

return true;

If the new URL's host varies from the starting host, the foundURL function will return

false , causing the spider to ignore the new URL. The above lines of code can be reused in

any spider that is to operate only on a single host.

The processURL method, which usually downloads a URL, is fairly simple. Because

we are only checking links, we do not need to actually download the page. This can be done

by calling the readAll method of the ParseHTML object.

try {

parse.readAll();

} catch (IOException e) {

logger.log(Level.INFO, "Error reading page:" + url.toString());

}

Search WWH ::

Custom Search

Home