Java Reference
In-Depth Information
If an asterisk was not specified, then the user agent command must match the user agent
for the spider. If they do match then the program begins actively checking for
Disallow
commands.
if ((this.userAgent != null) && rest.equalsIgnoreCase(this.
userAgent)) {
this.active = true;
}
If we are currently active, then we need to check for additional commands.
if (this.active) {
Next, we check to see if this is a
Disallow
command. If it is a
Disallow
com-
mand, then we create a new URL from the information provided.
if (command.equalsIgnoreCase("disallow")) {
if(rest.trim().length()>0) {
URL url = new URL(this.robotURL, rest);
add(url.getFile());
}
}
At this point, the line is now parsed, and we are ready for the next line.
Determining if a URL is to be Excluded
To determine if a URL should be excluded, the
isExcluded
function is called. If the
URL should be excluded, this method will return a value of
false
. This method begins by
looping through all URLs in the
exclude
list. If the specified URL matches a URL in the
exclude
list, a value of
true
is returned.
for (String str : this.exclude) {
if (url.getFile().startsWith(str)) {
return true;
}
}
Return a value of
false
if the URL was not found.
return false;
Prior to adding any URL to the workload, the spider will always call this method.