Java Reference
In-Depth Information
There are also several methods and functions that make up the
RobotsFilter
.
These will be discussed in the next sections.
Processing a New Host
When a new host is about to be processed, the spider calls the
newHost
method of
any filters that are in use. When the
newHost
method is called for the
RobotsFilter
class, the host is scanned for a
robots.txt
file. If one is found, it is processed.
The first action that the
newHost
method performs, is to declare a
String
which
holds lines read in from the
robots.txt
file. Then it sets the
active
variable to
false
.
String str;
this.active = false;
this.userAgent = userAgent;
Next, a connection is opened to the
robots.txt
file. The
robots.txt
file is
always located at the root of the host.
this.robotURL = new URL("http", host, 80, "/robots.txt");
URLConnection http = this.robotURL.openConnection();
If a user agent was specified using the
userAgent
variable, then the user agent is set
for the connection. If the
userAgent
variable is
null
, then the default Java user agent
will be used.
if (userAgent != null) {
http.setRequestProperty("User-Agent", userAgent);
}
Next, a
BufferedReader
is setup to read the
robots.txt
file on a line-by-line
basis.
InputStream is = http.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader r = new BufferedReader(isr);
We are now about to begin parsing the file, so it is important to clear any previous list of
excluded URLs.
This.exclude.clear();
A
while
loop is used to read each line from the
robots.txt
file. Each line read
is passed onto the
loadLine
method. The
loadLine
method actually interprets the
command given for each line of the
robots.txt
file.
try {
while ((str = r.readLine()) != null) {