Java Reference
In-Depth Information
try {
source = this.workloadManager.getSource(url);
StringBuilder str = new StringBuilder();
str.append("Bad URL:");
str.append(url.toString());
str.append(" found at ");
str.append(source.toString());
this.bad.add(str.toString());
} catch (WorkloadException e) {
e.printStackTrace();
}
}
}
The ReportLinks class implements all of the functions and methods defined by the
SpiderReportable interface. To review what these methods and functions are for,
refer to Table 13.1.
The foundURL function is called each time a new URL is found. Because this spider
only operates on a single web server, the foundURL method ensures that all new URLs
are on the same server.
if ((this.base != null) &&
(!this.base.equalsIgnoreCase(url.getHost()))) {
return false;
}
return true;
If the new URL's host varies from the starting host, the foundURL function will return
false , causing the spider to ignore the new URL. The above lines of code can be reused in
any spider that is to operate only on a single host.
The processURL method, which usually downloads a URL, is fairly simple. Because
we are only checking links, we do not need to actually download the page. This can be done
by calling the readAll method of the ParseHTML object.
try {
parse.readAll();
} catch (IOException e) {
logger.log(Level.INFO, "Error reading page:" + url.toString());
}
Search WWH ::




Custom Search