HTML and CSS Reference
In-Depth Information
Block Robots
Add robots.txt files in directories you wish to block. Install a honey pot to detect and block impolite robots and
venomous spiders.
User-agent: *
Disallow: /scripts/
Disallow: /styles/
Disallow: /management/
Disallow: /support/
Disallow: /DTD/
Motivation
Robots waste your bandwidth. Robots can discover and expose private pages.
Potential Trade-offs
Blocking robots prevents search engines from indexing your site and keeps people from finding your pages. Be
careful to only block subtrees you really don't want to be public.
Mechanics
First identify and catalog those parts of your URL hierarchy that should be invisible to search engines and other
spiders. Remember to look at this from the perspective of an outside user looking in. This is based on the
apparent URL hierarchy, not the actual layout of files on a disk. For example, here's a typical list of URLs you
might want to exclude:
/cgi-bin
/store/checkout
/personaldata/
/experimental
/staging
Of course, the details will vary depending on your site layout. I suggest blocking even those directories a robot
"can't" get into. Defense in depth is a good thing.
Place each of these in a file called robots.txt, preceded by Disallow: like so:
User-agent: *
Disallow: /cgi-bin
Search WWH ::




Custom Search