Web Applications - Refactoring HTML: Improving the Design of Existing Web Applications

HTML and CSS Reference

In-Depth Information

Disallow: /hidden

Then check your server logs to see which IPs have actually loaded that file. Also check to see what other files

those IPs have loaded. If it's just a few files, widely separated in time, I'd ignore it. But if I see that IP address

has been hitting every other page on my site, I'll block it in my Apache .htconfig file, like so:

Order allow,deny

Allow from all

Deny from 212.0.138.30

Deny from 83.149.74.179

Deny from 66.186.173.166

</Directory>

This prevents it from hitting any page on my site, not just the protected ones. However, chances are that spider

is up to no good, so I don't mind doing this.

You can also use mod_rewrite to block robots by User-agent. However, it's so easy to change the User-agent

string that I rarely bother with this. I find it hard to believe that a spider that's ignoring robots.txt is not going

to fake its User-agent string to look exactly like a perfectly legitimate copy of Firefox or Internet Explorer.

Some people have automated this procedure or used other means of detection. In particular, if anyone is hitting

your site more than 12 times per minute, he may well be up to no good. However, this requires quite a bit more

server intelligence than simple IP blocking.

Search WWH ::

Custom Search

Home