Java Reference
In-Depth Information
{
this.active = true;
} else
{
if ((this.userAgent != null) &&
rest.equalsIgnoreCase(this.userAgent))
{
this.active = true;
}
}
}
if (this.active)
{
if (command.equalsIgnoreCase("disallow"))
{
if (rest.trim().length() > 0)
{
URL url = new URL(this.robotURL, rest);
add(url.getFile());
}
}
}
}
}
The
RobotsFilter
class defines four instance variables, which are listed here:
• robotURL
• exclude
• active
• userAgent
The
robotURL
variable holds the URL to the
robots.txt
file that was most re-
cently received. Each time the
newHost
method is called, a new
robotURL
variable is
constructed by concatenating the string “robots.txt” to the host name.
The
exclude
variable contains a list of the URLs that are to be excluded. This list is
built each time a new host is encountered. The
exclude
list must be cleared for each new
host.
The
active
variable keeps track of whether or not the loading process is actively
tracking Disallow lines. The loader becomes active when a
User-agent
line matches the
user agent string being used by the spider.
The
userAgent
variable holds the user agent string that the spider is using. This
variable is passed into the
newHost
method.