WELL BEHAVED BOTS - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

The robots.txt file is simply placed at the root level of a domain where it can be

viewed by a web browser. For example, to see the robots.txt file for Wikipedia visit the

following URL:

http://en.wikipedia.org/wiki/Robots.txt

The Wikipedia robots.txt file is fairly long. The format itself is actually quite sim-

ple. The following section will describe it.

Because the robots.txt file is publicly accessible, it is not an effective way to

hide “private” parts of your web site. Any user with a browser can quickly examine your

robots.txt file. To make a part of your web site truly secure you must use more advanced

methods than robots.txt . It is generally best to list a private section in robots.txt

and to also assign a password to this part of your web site.

Understanding the robots.txt Format

The two lines that you will most often see in a robots.txt file are User-agent:

and Disallow: . The Disallow prefixed lines specify what URLs should not be ac-

cessed. The User-agent prefixed lines tell you which program the Disallow lines

refer to.

As discussed in Chapter 13, you should create a User-agent name to identify

your spider. This will allow a site to exclude your spider, if they so desire. By default the

Heaton Research Spider uses Java's own User-agent string. Because many programs

use this User-agent you should choose another. If a site were to exclude the Java

User-agent , your spider would be excluded as well.

You may also see User-agent prefixed lines that specify a user agent of “*”. This

means all bots. Any instructions following a User-agent of “*” should be observed by

all bots, including yours.

The robots.txt Disallow patterns are matched by simple substring compari-

sons, so care should be taken to make sure that patterns matching directories have the final

'/' character appended. If not, all files with names starting with that substring will match,

rather than just those in the directory intended.

This following robots.txt file allows all robots to visit all files because the wild card

“*” specifies all robots:

User-agent: *

Disallow:

The following robots.txt is the opposite of the preceding one it blocks all bot ac-

cess:

Search WWH ::

Custom Search

Home