Java Reference
In-Depth Information
try {
source = this.workloadManager.getSource(url);
StringBuilder str = new StringBuilder();
str.append("Bad URL:");
str.append(url.toString());
str.append(" found at ");
str.append(source.toString());
this.bad.add(str.toString());
} catch (WorkloadException e) {
e.printStackTrace();
}
}
}
The
ReportLinks
class implements all of the functions and methods defined by the
SpiderReportable
interface. To review what these methods and functions are for,
refer to Table 13.1.
The
foundURL
function is called each time a new URL is found. Because this spider
only operates on a single web server, the
foundURL
method ensures that all new URLs
are on the same server.
if ((this.base != null) &&
(!this.base.equalsIgnoreCase(url.getHost()))) {
return false;
}
return true;
If the new URL's host varies from the starting host, the
foundURL
function will return
false
, causing the spider to ignore the new URL. The above lines of code can be reused in
any spider that is to operate only on a single host.
The
processURL
method, which usually downloads a URL, is fairly simple. Because
we are only checking links, we do not need to actually download the page. This can be done
by calling the
readAll
method of the
ParseHTML
object.
try {
parse.readAll();
} catch (IOException e) {
logger.log(Level.INFO, "Error reading page:" + url.toString());
}