USING A SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

C HAPTER 13: U SING A S PIDER

• Understanding when to Use a Spider

• Introducing the Heaton Research Spider

• Using Thread Pools

• Using Memory to Track URLs

• Using SQL to Track URLs

• Spidering Many Web Sites

A spider is a special type of bot that is designed to crawl the World Wide Web, just like

a biological spider crawls its web. While bots are usually designed to work with one specific

site, spiders are usually designed to work with a vast number of web sites.

A spider always starts on one single page. This page is scanned for links. All links to

other pages, are stored in a list. The spider will then visit the next URL on the list and scan for

links again. This process continues endlessly as the spider finds new pages.

Usually, some restriction is placed on the spider to limit what pages it will visit. If no

restrictions were placed on the spider, the spider would theoretically visit every page on the

entire Internet.

Most of the recipes in this topic have been fairly self-contained, usually requiring only a

class or two to function. This presents the reader with an efficient solution, with minimum

overhead. This strategy does not work so well with a spider. To create a truly effective spider,

there are many considerations, such as:

• Thread pooling and synchronization

• Storing a very large URL list

• HTML parsing

• Working with an SQL database

• Reporting results

These considerations do not lend themselves to a concise example. This topic will show

you how to use the “Heaton Research Spider”. The Heaton Research Spider is an ever-evolv-

ing open source spider produced by the publisher of this topic. The Heaton Research Spider

is also available in C#, as well as Java, and can be obtained from the following URL:

http://www.heatonresearch.com/projects/spider/

Search WWH ::

Custom Search

Home