Java Reference
In-Depth Information
* The user agent being used by the spider. Leave
* null for default.
* @throws IOException
* Thrown if an I/O error occurs.
*/
public void newHost(String host, String userAgent) throws
IOException;
}
The first required function is named isExcluded . The isExcluded function will
be called for each URL that the spider finds for the current host. If the URL is to be excluded,
then a value of true will be returned.
The second required method is named newHost . The newHost method is called
whenever the spider is processing a new host. As you will recall from Chapter 14, the spider
processes one host at a time. As a result, newHost will be called when a new host is being
processed. It will also be called when the first host is being processed. After newHost is
called, isExcluded will be called for each URL found at that host. Hosts will not be over-
lapped. Once newHost is called, you will not receive URLs from the previous host.
Implementing a robots.txt Filter
The RobotsFilter class provided by the Heaton Research Spider, implements the
SpiderFilter interface. This section will show how the RobotsFilter filter was
implemented. The RobotsFilter class is shown in Listing 16.3.
Listing 16.3: A Robots.txt Filter (RobotsFilter.java)
package com.heatonresearch.httprecipes.spider.filter;
import java.io.*;
import java.net.*;
import java.util.*;
public class RobotsFilter implements SpiderFilter
{
/**
* The full URL of the robots.txt file.
*/
private URL robotURL;
/**
* A list of URL's to exclude.
*/
private List<String> exclude = new ArrayList<String>();
Search WWH ::




Custom Search