Java Reference
In-Depth Information
If an asterisk was not specified, then the user agent command must match the user agent
for the spider. If they do match then the program begins actively checking for Disallow
commands.
if ((this.userAgent != null) && rest.equalsIgnoreCase(this.
userAgent)) {
this.active = true;
}
If we are currently active, then we need to check for additional commands.
if (this.active) {
Next, we check to see if this is a Disallow command. If it is a Disallow com-
mand, then we create a new URL from the information provided.
if (command.equalsIgnoreCase("disallow")) {
if(rest.trim().length()>0) {
URL url = new URL(this.robotURL, rest);
add(url.getFile());
}
}
At this point, the line is now parsed, and we are ready for the next line.
Determining if a URL is to be Excluded
To determine if a URL should be excluded, the isExcluded function is called. If the
URL should be excluded, this method will return a value of false . This method begins by
looping through all URLs in the exclude list. If the specified URL matches a URL in the
exclude list, a value of true is returned.
for (String str : this.exclude) {
if (url.getFile().startsWith(str)) {
return true;
}
}
Return a value of false if the URL was not found.
return false;
Prior to adding any URL to the workload, the spider will always call this method.
Search WWH ::




Custom Search