Java Reference
In-Depth Information
* The user agent being used by the spider. Leave
* null for default.
* @throws IOException
* Thrown if an I/O error occurs.
*/
public void newHost(String host, String userAgent) throws
IOException;
}
The first required function is named
isExcluded
. The
isExcluded
function will
be called for each URL that the spider finds for the current host. If the URL is to be excluded,
then a value of
true
will be returned.
The second required method is named
newHost
. The
newHost
method is called
whenever the spider is processing a new host. As you will recall from Chapter 14, the spider
processes one host at a time. As a result,
newHost
will be called when a new host is being
processed. It will also be called when the first host is being processed. After
newHost
is
called,
isExcluded
will be called for each URL found at that host. Hosts will not be over-
lapped. Once
newHost
is called, you will not receive URLs from the previous host.
Implementing a robots.txt Filter
The
RobotsFilter
class provided by the Heaton Research Spider, implements the
SpiderFilter
interface. This section will show how the
RobotsFilter
filter was
implemented. The
RobotsFilter
class is shown in Listing 16.3.
Listing 16.3: A Robots.txt Filter (RobotsFilter.java)
package com.heatonresearch.httprecipes.spider.filter;
import java.io.*;
import java.net.*;
import java.util.*;
public class RobotsFilter implements SpiderFilter
{
/**
* The full URL of the robots.txt file.
*/
private URL robotURL;
/**
* A list of URL's to exclude.
*/
private List<String> exclude = new ArrayList<String>();