WELL BEHAVED BOTS - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

CAPTCHAs are designed to make sure that a human is using the system. The obvious

circumvention is to have a human identify hundreds of CAPTCHAs an hour in a highly auto-

mated fashion. Low-wage human workers could perform this task.

Some poorly designed CAPTCHA protection systems can be bypassed, without using

OCR, simply by re-using the session ID of a known CAPTCHA image. Sometimes, part of the

software generating the CAPTCHA is client-side, and the non-image-text can easily be lifted

from the HTML page. As web sites become more sophisticated against bot spamming, client

side CAPTCHAs are becoming very rare.

Overall CAPTCHAs are a fairly effective defense against bots. Using a CAPTCHA on a

site will force the spammer, or other malicious bot programmer, to go to great lengths to ac-

cess the site.

User Agent Filtering

Some web server software allows you to exclude certain clients based on their

User-Agent . If you are using the Apache web server, this is configured using the

.htaccess file. For example, to exclude the User-Agent's of BadBot and

AnotherBadBot , use the following .htaccess file:

RewriteEngine On

RewriteCond %{HTTP_USER_AGENT} ^BadBot [OR]

RewriteCond %{HTTP_USER_AGENT} ^AnotherBadBot

If either of these bots tries to access a URL on the sever, they will receive the following

error:

403 Access Denied

Simply changing the User-agent name of your bot can circumvent this. Of course,

this is an unethical thing to do!

User-agent filtering is generally very effective against commercial spiders that do not

allow their User-agent to be changed.

Robots Exclusion Standard

The robots exclusion standard, or robots.txt protocol, allows a web site to specify

what portions of that site can be accessed by bots. The information specifying the parts that

should not be accessed is specified in a file called robots.txt in the top-level directory

of the website.

The robots.txt protocol was created by consensus in June 1994 by members of the

robots mailing list ( robots-request@nexor.co.uk ). There is no official stan-

dards body or RFC document for this protocol. RFC, or request for comments, is a document

that describes an official Internet standard.

Search WWH ::

Custom Search

Home