Java Reference
In-Depth Information
C HAPTER 13: U SING A S PIDER
• Understanding when to Use a Spider
• Introducing the Heaton Research Spider
• Using Thread Pools
• Using Memory to Track URLs
• Using SQL to Track URLs
• Spidering Many Web Sites
A spider is a special type of bot that is designed to crawl the World Wide Web, just like
a biological spider crawls its web. While bots are usually designed to work with one specific
site, spiders are usually designed to work with a vast number of web sites.
A spider always starts on one single page. This page is scanned for links. All links to
other pages, are stored in a list. The spider will then visit the next URL on the list and scan for
links again. This process continues endlessly as the spider finds new pages.
Usually, some restriction is placed on the spider to limit what pages it will visit. If no
restrictions were placed on the spider, the spider would theoretically visit every page on the
entire Internet.
Most of the recipes in this topic have been fairly self-contained, usually requiring only a
class or two to function. This presents the reader with an efficient solution, with minimum
overhead. This strategy does not work so well with a spider. To create a truly effective spider,
there are many considerations, such as:
• Thread pooling and synchronization
• Storing a very large URL list
• HTML parsing
• Working with an SQL database
• Reporting results
These considerations do not lend themselves to a concise example. This topic will show
you how to use the “Heaton Research Spider”. The Heaton Research Spider is an ever-evolv-
ing open source spider produced by the publisher of this topic. The Heaton Research Spider
is also available in C#, as well as Java, and can be obtained from the following URL:
http://www.heatonresearch.com/projects/spider/
 
Search WWH ::




Custom Search