INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

logger.log(Level.SEVERE,

"Caught exception at URL:" + this.url.toString(), e);

this.spider.getReport().spiderURLError(this.url);

return;

} finally {

if (is != null) {

try {

is.close();

} catch (IOException e) {

}

try {

// mark URL as complete

this.spider.getWorkloadManager().markProcessed(this.url);

logger.fine("Complete: " + this.url);

if (!this.url.equals(connection.getURL())) {

// save the URL(for redirect's)

this.spider.getWorkloadManager().add(

connection.getURL(), this.url,

this.spider.getWorkloadManager().getDepth(

connection.getURL()));

this.spider.getWorkloadManager().markProcessed(

connection.getURL());

}

} catch (WorkloadException e) {

logger.log(Level.WARNING, "Error marking workload(3).", e);

}

As the thread pool processes the SpiderWorker objects presented to it, the run

methods from these SpiderWorker classes are executed. The run method begins by

logging the URL that it is currently processing. Then a connection is opened to that URL.

try {

logger.fine("Processing: " + this.url);

// Get the URL's contents.

connection = this.url.openConnection();

Next, the timeout values are set. The same timeout value is used for both connection and

read timeouts.

connection.setConnectTimeout(this.spider.getOptions().timeout);

connection.setReadTimeout(this.spider.getOptions().timeout);

Search WWH ::

Custom Search

Home