HTML and CSS Reference
Place this file at the root level of your web server. That is, it should be able to be retrieved from
http://www.example.com/robots.txt . Remember that URLs are case-sensitive, so ROBOTS.TXT or robots.TXT
and similar variations will not work.
You can also block robots by User-agent. For example, if you have a grudge against Google and want to keep it
out of your pages, the following robots.txt will do that while allowing other search engines in:
Note that this works only on robots. This would not allow you to exclude real browsers such as Mozilla and
Internet Explorer, no matter what you put in the User-agent string. The goal here is precisely to block robots
while allowing real people in.
You can even specify different rules for different robots. For example, this blocks all robots from /cgi-bin, blocks
Googlebot from /staging and /experimental, and blocks Turnitin from the entire site:
The syntax here is quite minimal. In particular, there's no "Allow" command. You cannot block all user agents
from a directory and then allow one in particular. Similarly, you cannot block a root directory but allow access to
one or more of its file or subdirectories. Robots.txt is a pretty blunt instrument.
You can also specify a robots meta tag in the head of HTML documents. However, few robots recognize or
respect this. You really should use a robots.txt file to prevent robotic visits.
The especially dangerous robots are those that don't follow the rules and spider your site whether you permit
them to or not. To prevent these you have to detect them and then block them by IP address. Detecting them
isn't hard. You just set up a few links in your pages that only robots are likely to find. For example, you can
have a link with no content, like this:
Block the hidden directory in robots.txt so that well-behaved robots will ignore it: