Java Reference
In-Depth Information
Summary
Not all sites welcome bots. This is particularly true of bots that post information to web
sites. You should never create a bot that accesses web sites unethically. If a web site does not
wish to be accessed with a bot, you should respect that site's wishes. Web sites use a number
of means to prevent bots from accessing their site.
CAPTCHAs are a very common method for preventing bot access. CAPTCHA is an acro-
nym for "Completely Automated Public Turing test to tell Computers and Humans Apart". A
CAPTCHA displays an image and makes the user enter the text in the image. Because a bot
cannot easily read the image, the bot is unable to continue.
User agent filtering is another common method for denying access to bots. Most web
servers allow the web master to enter a list of user agents which identify bots that they wish
to block. If any bot that is on the blocked list attempts to access the site, they will receive an
HTTP error.
The Bot Exclusion Standard is another method that is commonly used to restrict bots.
Web site owners can place a file, named robots.txt , in the root of their web server de-
fining what parts of the site are disallowed for certain bots. Every bot should access this file
for information about what parts of the web site to stay away from.
You now know how to create a variety of bots that can access web sites in many different
ways. This chapter ends the topic by showing you how to use bots ethically. How you use bots
is ultimately up to you. It is our hope that you use bots to make the Internet a better place for
all!
Remember, with great power, comes great responsibility.
-- Uncle Ben, Spider Man, Sony Pictures 2002
Search WWH ::




Custom Search