Java Reference
In-Depth Information
User-agent: *
Disallow: /
To block four directories on your site, use a similar
robots.txt
file.
User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /tmp/
Disallow: /private/
If you would like to exclude a specific bot, named
BadBot
, from your
/private/
directory, use the following
robots.txt
file.
User-agent: BadBot
Disallow: /private/
The pound “#” character can be used to insert comments, such as:
# This is a comment
User-agent: * # match all bots
Disallow: / # keep them out
The wild card character can only be used with the
User-Agent
directive. The follow-
ing line would be invalid:
Disallow: *
Rather, to disallow everything, simply disallow your document root, as follows:
Disallow: /
In the next section, we will see how to use a filter with the Heaton Research Spider to
follow the
robots.txt
files posted on sites.
Using Filters with the Heaton Research Spider
The Heaton Research Spider allows you to provide one or more filter classes. These filter
classes instruct the spider to skip certain URLs. You can specify a filter class as part of the spi-
der configuration file. The spider comes with one built in filter, named
RobotsFilter
.
This filter scans a
robots.txt
file and instructs the spider to skip URLs that were
marked as Disallowed in the
robots.txt
file.
Listing 16.1 shows a spider configuration file that specifies the
RobotsFilter
.
Listing 16.1: Spider Configuration (Spider.conf)
timeout: 60000
maxDepth: -1
corePoolSize: 100
maximumPoolSize:100