Java Reference
In-Depth Information
User-agent: *
Disallow: /
To block four directories on your site, use a similar robots.txt file.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
If you would like to exclude a specific bot, named BadBot , from your /private/
directory, use the following robots.txt file.
User-agent: BadBot
Disallow: /private/
The pound “#” character can be used to insert comments, such as:
# This is a comment
User-agent: * # match all bots
Disallow: / # keep them out
The wild card character can only be used with the User-Agent directive. The follow-
ing line would be invalid:
Disallow: *
Rather, to disallow everything, simply disallow your document root, as follows:
Disallow: /
In the next section, we will see how to use a filter with the Heaton Research Spider to
follow the robots.txt files posted on sites.
Using Filters with the Heaton Research Spider
The Heaton Research Spider allows you to provide one or more filter classes. These filter
classes instruct the spider to skip certain URLs. You can specify a filter class as part of the spi-
der configuration file. The spider comes with one built in filter, named RobotsFilter .
This filter scans a robots.txt file and instructs the spider to skip URLs that were
marked as Disallowed in the robots.txt file.
Listing 16.1 shows a spider configuration file that specifies the RobotsFilter .
Listing 16.1: Spider Configuration (Spider.conf)
timeout: 60000
maxDepth: -1
corePoolSize: 100
maximumPoolSize:100
Search WWH ::




Custom Search