INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

If a User-Agent was specified, then the user agent is set.

if (this.spider.getOptions().userAgent != null) {

connection.setRequestProperty("User-Agent",

this.spider.getOptions().userAgent);

}

The spider is now ready to read the contents of the URL. First, the spider checks to see

if the data from the URL is of the MIME type text/html . If this is a text/html docu-

ment, then a new SpiderParseHTML object is created and the spiderProcessURL

method is called for the SpiderReportable object.

The SpiderParseHTML object works exactly the same as the ParseHTML class,

except that it allows the spider to gather links as the spiderProcessURL method

parses the HTML. This makes the spider link gathering transparent to the class using the

spider.

// Read the URL.

is = connection.getInputStream();

// Parse the URL.

if (connection.getContentType().equalsIgnoreCase("text/html")) {

SpiderParseHTML parse = new SpiderParseHTML(connection.getURL(),

new SpiderInputStream(is, null), this.spider);

this.spider.getReport().spiderProcessURL(this.url, parse);

} else {

this.spider.getReport().spiderProcessURL(this.url, is);

}

If an I/O exception occurs while reading the page, then the exception is logged.

} catch (IOException e) {

logger.log(Level.INFO, "I/O error on URL:"

+ this.url.toString());

try {

In addition to logging the exception, the page is also marked with “error” in the workload

manager.

this.spider.getWorkloadManager().markError(this.url);

} catch (WorkloadException e1) {

logger.log(Level.WARNING, "Error marking workload(1).", e);

}

this.spider.getReport().spiderURLError(this.url);

return;

The spider also traps any Throwable exceptions that occur. This prevents errors that

occur in the SpiderReportable class from causing the spider to crash. If an excep-

tion occurs, it is logged, and the spider continues.

Search WWH ::

Custom Search

Home