Java Reference
In-Depth Information
CAPTCHAs are designed to make sure that a human is using the system. The obvious
circumvention is to have a human identify hundreds of CAPTCHAs an hour in a highly auto-
mated fashion. Low-wage human workers could perform this task.
Some poorly designed CAPTCHA protection systems can be bypassed, without using
OCR, simply by re-using the session ID of a known CAPTCHA image. Sometimes, part of the
software generating the CAPTCHA is client-side, and the non-image-text can easily be lifted
from the HTML page. As web sites become more sophisticated against bot spamming, client
side CAPTCHAs are becoming very rare.
Overall CAPTCHAs are a fairly effective defense against bots. Using a CAPTCHA on a
site will force the spammer, or other malicious bot programmer, to go to great lengths to ac-
cess the site.
User Agent Filtering
Some web server software allows you to exclude certain clients based on their
User-Agent . If you are using the Apache web server, this is configured using the
.htaccess file. For example, to exclude the User-Agent's of BadBot and
AnotherBadBot , use the following .htaccess file:
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BadBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^AnotherBadBot
If either of these bots tries to access a URL on the sever, they will receive the following
error:
403 Access Denied
Simply changing the User-agent name of your bot can circumvent this. Of course,
this is an unethical thing to do!
User-agent filtering is generally very effective against commercial spiders that do not
allow their User-agent to be changed.
Robots Exclusion Standard
The robots exclusion standard, or robots.txt protocol, allows a web site to specify
what portions of that site can be accessed by bots. The information specifying the parts that
should not be accessed is specified in a file called robots.txt in the top-level directory
of the website.
The robots.txt protocol was created by consensus in June 1994 by members of the
robots mailing list ( robots-request@nexor.co.uk ). There is no official stan-
dards body or RFC document for this protocol. RFC, or request for comments, is a document
that describes an official Internet standard.
Search WWH ::




Custom Search