Java Reference
In-Depth Information
If a User-Agent was specified, then the user agent is set.
if (this.spider.getOptions().userAgent != null) {
connection.setRequestProperty("User-Agent",
this.spider.getOptions().userAgent);
}
The spider is now ready to read the contents of the URL. First, the spider checks to see
if the data from the URL is of the MIME type text/html . If this is a text/html docu-
ment, then a new SpiderParseHTML object is created and the spiderProcessURL
method is called for the SpiderReportable object.
The SpiderParseHTML object works exactly the same as the ParseHTML class,
except that it allows the spider to gather links as the spiderProcessURL method
parses the HTML. This makes the spider link gathering transparent to the class using the
spider.
// Read the URL.
is = connection.getInputStream();
// Parse the URL.
if (connection.getContentType().equalsIgnoreCase("text/html")) {
SpiderParseHTML parse = new SpiderParseHTML(connection.getURL(),
new SpiderInputStream(is, null), this.spider);
this.spider.getReport().spiderProcessURL(this.url, parse);
} else {
this.spider.getReport().spiderProcessURL(this.url, is);
}
If an I/O exception occurs while reading the page, then the exception is logged.
} catch (IOException e) {
logger.log(Level.INFO, "I/O error on URL:"
+ this.url.toString());
try {
In addition to logging the exception, the page is also marked with “error” in the workload
manager.
this.spider.getWorkloadManager().markError(this.url);
} catch (WorkloadException e1) {
logger.log(Level.WARNING, "Error marking workload(1).", e);
}
this.spider.getReport().spiderURLError(this.url);
return;
The spider also traps any Throwable exceptions that occur. This prevents errors that
occur in the SpiderReportable class from causing the spider to crash. If an excep-
tion occurs, it is logged, and the spider continues.
Search WWH ::




Custom Search