Java Reference
In-Depth Information
keepAliveTime: 60
dbURL: jdbc:mysql://192.168.1.20/spider?user=root
dbClass: com.mysql.jdbc.Driver
workloadManager:com.heatonresearch.httprecipes.spider.workload.
sql.SQLWorkloadManager
startup: clear
filter: com.heatonresearch.httprecipes.spider.filter.RobotsFilter
As you can see from the above listing, the
filter
line specifies the filter. If you would
like to use more than one filter, simply include more than one
filter
line.
You should always make use of the
RobotsFilter
. If a web master specifies to skip
parts of their site or the entire site through the
robots.txt
file, you should honor this
request. Including the
RobotsFilter
honors this file automatically.
The Filter Interface
To create a filter class for the Heaton Research spider, that class must implement the
SpiderFilter
interface. This interface defines the two necessary functions that must
be implemented to create a filter. The
SpiderFilter
interface is shown in Listing
16.2.
Listing 16.2: The SpiderFilter Interface (SpiderFilter.java)
package com.heatonresearch.httprecipes.spider.filter;
import java.io.*;
import java.net.*;
public interface SpiderFilter {
/**
* Check to see if the specified URL is to be excluded.
*
* @param url
* The URL to be checked.
* @return Returns true if the URL should be excluded.
*/
public boolean isExcluded(URL url);
/**
* Called when a new host is to be processed. Hosts
* are processed one at a time. SpiderFilter classes
* can not be shared among hosts.
*
* @param host
* The new host.
* @param userAgent