Java Reference
In-Depth Information
The robots.txt file is simply placed at the root level of a domain where it can be
viewed by a web browser. For example, to see the robots.txt file for Wikipedia visit the
following URL:
http://en.wikipedia.org/wiki/Robots.txt
The Wikipedia robots.txt file is fairly long. The format itself is actually quite sim-
ple. The following section will describe it.
Because the robots.txt file is publicly accessible, it is not an effective way to
hide “private” parts of your web site. Any user with a browser can quickly examine your
robots.txt file. To make a part of your web site truly secure you must use more advanced
methods than robots.txt . It is generally best to list a private section in robots.txt
and to also assign a password to this part of your web site.
Understanding the robots.txt Format
The two lines that you will most often see in a robots.txt file are User-agent:
and Disallow: . The Disallow prefixed lines specify what URLs should not be ac-
cessed. The User-agent prefixed lines tell you which program the Disallow lines
refer to.
As discussed in Chapter 13, you should create a User-agent name to identify
your spider. This will allow a site to exclude your spider, if they so desire. By default the
Heaton Research Spider uses Java's own User-agent string. Because many programs
use this User-agent you should choose another. If a site were to exclude the Java
User-agent , your spider would be excluded as well.
You may also see User-agent prefixed lines that specify a user agent of “*”. This
means all bots. Any instructions following a User-agent of “*” should be observed by
all bots, including yours.
The robots.txt Disallow patterns are matched by simple substring compari-
sons, so care should be taken to make sure that patterns matching directories have the final
'/' character appended. If not, all files with names starting with that substring will match,
rather than just those in the directory intended.
This following robots.txt file allows all robots to visit all files because the wild card
“*” specifies all robots:
User-agent: *
Disallow:
The following robots.txt is the opposite of the preceding one it blocks all bot ac-
cess:
Search WWH ::




Custom Search