Java Reference
In-Depth Information
C HAPTER 14: I NSIDE THE H EATON R ESEARCH S PIDER
• The Spider Class
• How Workloads are Managed
• Reading Configuration files
• Thread Pools
• The Memory-Based Workload Manager
• Spider HTML Parsing
• Spider Streams
Chapter 13 taught you how to use the Heaton Research Spider. The Heaton Research
Spider is an advanced and very extensible spider that can be applied to both small and large
spider tasks. This chapter goes beyond showing you how to use the spider; it will show you
how the Heaton Research Spider is constructed. Because the Heaton Research Spider is
open source, you are free to download the source code and make your own modifications.
The Heaton Research Spider is an on going open source project. Because of this, there
may have been enhancements made to the spider after the publication of this topic. You can
always check the Heaton Research Spider's home page for the latest updates. The latest ver-
sion of the Heaton Research Spider can always be found at:
http://www.heatonresearch.com/projects/spider/
If you are content simply using the Heaton Research Spider and are not currently in-
terested in how it works internally, you can safely skip to Chapter 16, “Well Behaved Bots”.
However, you may still wish to visit the above URL to obtain the latest version. Additionally,
there is a forum at the above URL where you can discuss using and modifying the Heaton
Research Spider.
The Heaton Research Spider is made up of several different classes. These classes are
shown in Table 14.1 and are summarized in Table 14.1.
 
Search WWH ::




Custom Search