WELL BEHAVED BOTS - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

There are also several methods and functions that make up the RobotsFilter .

These will be discussed in the next sections.

Processing a New Host

When a new host is about to be processed, the spider calls the newHost method of

any filters that are in use. When the newHost method is called for the RobotsFilter

class, the host is scanned for a robots.txt file. If one is found, it is processed.

The first action that the newHost method performs, is to declare a String which

holds lines read in from the robots.txt file. Then it sets the active variable to

false .

String str;

this.active = false;

this.userAgent = userAgent;

Next, a connection is opened to the robots.txt file. The robots.txt file is

always located at the root of the host.

this.robotURL = new URL("http", host, 80, "/robots.txt");

URLConnection http = this.robotURL.openConnection();

If a user agent was specified using the userAgent variable, then the user agent is set

for the connection. If the userAgent variable is null , then the default Java user agent

will be used.

if (userAgent != null) {

http.setRequestProperty("User-Agent", userAgent);

}

Next, a BufferedReader is setup to read the robots.txt file on a line-by-line

basis.

InputStream is = http.getInputStream();

InputStreamReader isr = new InputStreamReader(is);

BufferedReader r = new BufferedReader(isr);

We are now about to begin parsing the file, so it is important to clear any previous list of

excluded URLs.

This.exclude.clear();

A while loop is used to read each line from the robots.txt file. Each line read

is passed onto the loadLine method. The loadLine method actually interprets the

command given for each line of the robots.txt file.

try {

while ((str = r.readLine()) != null) {

Search WWH ::

Custom Search

Home