Java Reference
In-Depth Information
If a
User-Agent
was specified, then the user agent is set.
if (this.spider.getOptions().userAgent != null) {
connection.setRequestProperty("User-Agent",
this.spider.getOptions().userAgent);
}
The spider is now ready to read the contents of the URL. First, the spider checks to see
if the data from the URL is of the MIME type
text/html
. If this is a
text/html
docu-
ment, then a new
SpiderParseHTML
object is created and the
spiderProcessURL
method is called for the
SpiderReportable
object.
The
SpiderParseHTML
object works exactly the same as the
ParseHTML
class,
except that it allows the spider to gather links as the
spiderProcessURL
method
parses the HTML. This makes the spider link gathering transparent to the class using the
spider.
// Read the URL.
is = connection.getInputStream();
// Parse the URL.
if (connection.getContentType().equalsIgnoreCase("text/html")) {
SpiderParseHTML parse = new SpiderParseHTML(connection.getURL(),
new SpiderInputStream(is, null), this.spider);
this.spider.getReport().spiderProcessURL(this.url, parse);
} else {
this.spider.getReport().spiderProcessURL(this.url, is);
}
If an I/O exception occurs while reading the page, then the exception is logged.
} catch (IOException e) {
logger.log(Level.INFO, "I/O error on URL:"
+ this.url.toString());
try {
In addition to logging the exception, the page is also marked with “error” in the workload
manager.
this.spider.getWorkloadManager().markError(this.url);
} catch (WorkloadException e1) {
logger.log(Level.WARNING, "Error marking workload(1).", e);
}
this.spider.getReport().spiderURLError(this.url);
return;
The spider also traps any
Throwable
exceptions that occur. This prevents errors that
occur in the
SpiderReportable
class from causing the spider to crash. If an excep-
tion occurs, it is logged, and the spider continues.