Java Reference
In-Depth Information
The
robots.txt
file is simply placed at the root level of a domain where it can be
viewed by a web browser. For example, to see the
robots.txt
file for Wikipedia visit the
following URL:
http://en.wikipedia.org/wiki/Robots.txt
The Wikipedia
robots.txt
file is fairly long. The format itself is actually quite sim-
ple. The following section will describe it.
Because the
robots.txt
file is publicly accessible, it is not an effective way to
hide “private” parts of your web site. Any user with a browser can quickly examine your
robots.txt
file. To make a part of your web site truly secure you must use more advanced
methods than
robots.txt
. It is generally best to list a private section in
robots.txt
and to also assign a password to this part of your web site.
Understanding the robots.txt Format
The two lines that you will most often see in a
robots.txt
file are
User-agent:
and
Disallow:
. The
Disallow
prefixed lines specify what URLs should not be ac-
cessed. The
User-agent
prefixed lines tell you which program the
Disallow
lines
refer to.
As discussed in Chapter 13, you should create a
User-agent
name to identify
your spider. This will allow a site to exclude your spider, if they so desire. By default the
Heaton Research Spider uses Java's own
User-agent
string. Because many programs
use this
User-agent
you should choose another. If a site were to exclude the Java
User-agent
, your spider would be excluded as well.
You may also see
User-agent
prefixed lines that specify a user agent of “*”. This
means all bots. Any instructions following a
User-agent
of “*” should be observed by
all bots, including yours.
The
robots.txt
Disallow
patterns are matched by simple substring compari-
sons, so care should be taken to make sure that patterns matching directories have the final
'/' character appended. If not, all files with names starting with that substring will match,
rather than just those in the directory intended.
This following
robots.txt
file allows all robots to visit all files because the wild card
“*” specifies all robots:
User-agent: *
Disallow:
The following
robots.txt
is the opposite of the preceding one it blocks all bot ac-
cess: