WELL BEHAVED BOTS - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

* The user agent being used by the spider. Leave

* null for default.

* @throws IOException

* Thrown if an I/O error occurs.

*/

public void newHost(String host, String userAgent) throws

IOException;

}

The first required function is named isExcluded . The isExcluded function will

be called for each URL that the spider finds for the current host. If the URL is to be excluded,

then a value of true will be returned.

The second required method is named newHost . The newHost method is called

whenever the spider is processing a new host. As you will recall from Chapter 14, the spider

processes one host at a time. As a result, newHost will be called when a new host is being

processed. It will also be called when the first host is being processed. After newHost is

called, isExcluded will be called for each URL found at that host. Hosts will not be over-

lapped. Once newHost is called, you will not receive URLs from the previous host.

Implementing a robots.txt Filter

The RobotsFilter class provided by the Heaton Research Spider, implements the

SpiderFilter interface. This section will show how the RobotsFilter filter was

implemented. The RobotsFilter class is shown in Listing 16.3.

Listing 16.3: A Robots.txt Filter (RobotsFilter.java)

package com.heatonresearch.httprecipes.spider.filter;

import java.io.*;

import java.net.*;

import java.util.*;

public class RobotsFilter implements SpiderFilter

{

/**

* The full URL of the robots.txt file.

*/

private URL robotURL;

/**

* A list of URL's to exclude.

*/

private List<String> exclude = new ArrayList<String>();

Search WWH ::

Custom Search

Home