Java Reference
In-Depth Information
There are also several methods and functions that make up the RobotsFilter .
These will be discussed in the next sections.
Processing a New Host
When a new host is about to be processed, the spider calls the newHost method of
any filters that are in use. When the newHost method is called for the RobotsFilter
class, the host is scanned for a robots.txt file. If one is found, it is processed.
The first action that the newHost method performs, is to declare a String which
holds lines read in from the robots.txt file. Then it sets the active variable to
false .
String str;
this.active = false;
this.userAgent = userAgent;
Next, a connection is opened to the robots.txt file. The robots.txt file is
always located at the root of the host.
this.robotURL = new URL("http", host, 80, "/robots.txt");
URLConnection http = this.robotURL.openConnection();
If a user agent was specified using the userAgent variable, then the user agent is set
for the connection. If the userAgent variable is null , then the default Java user agent
will be used.
if (userAgent != null) {
http.setRequestProperty("User-Agent", userAgent);
}
Next, a BufferedReader is setup to read the robots.txt file on a line-by-line
basis.
InputStream is = http.getInputStream();
InputStreamReader isr = new InputStreamReader(is);
BufferedReader r = new BufferedReader(isr);
We are now about to begin parsing the file, so it is important to clear any previous list of
excluded URLs.
This.exclude.clear();
A while loop is used to read each line from the robots.txt file. Each line read
is passed onto the loadLine method. The loadLine method actually interprets the
command given for each line of the robots.txt file.
try {
while ((str = r.readLine()) != null) {
Search WWH ::




Custom Search