Java Reference
In-Depth Information
keepAliveTime: 60
dbURL: jdbc:mysql://192.168.1.20/spider?user=root
dbClass: com.mysql.jdbc.Driver
workloadManager:com.heatonresearch.httprecipes.spider.workload.
sql.SQLWorkloadManager
startup: clear
filter: com.heatonresearch.httprecipes.spider.filter.RobotsFilter
As you can see from the above listing, the filter line specifies the filter. If you would
like to use more than one filter, simply include more than one filter line.
You should always make use of the RobotsFilter . If a web master specifies to skip
parts of their site or the entire site through the robots.txt file, you should honor this
request. Including the RobotsFilter honors this file automatically.
The Filter Interface
To create a filter class for the Heaton Research spider, that class must implement the
SpiderFilter interface. This interface defines the two necessary functions that must
be implemented to create a filter. The SpiderFilter interface is shown in Listing
16.2.
Listing 16.2: The SpiderFilter Interface (SpiderFilter.java)
package com.heatonresearch.httprecipes.spider.filter;
import java.io.*;
import java.net.*;
public interface SpiderFilter {
/**
* Check to see if the specified URL is to be excluded.
*
* @param url
* The URL to be checked.
* @return Returns true if the URL should be excluded.
*/
public boolean isExcluded(URL url);
/**
* Called when a new host is to be processed. Hosts
* are processed one at a time. SpiderFilter classes
* can not be shared among hosts.
*
* @param host
* The new host.
* @param userAgent
Search WWH ::




Custom Search