USING A SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

g.fillRect(10, y, (int) (progressWidth * this.donePercent), 16);

Finally, a black border is drawn around the total width of the progress bar. This allows

the user to see a white region that represents how much longer it will take to process.

g.setColor(Color.BLACK);

g.drawRect(10, y, progressWidth, 16);

The bar will be updated until it reaches 100%.

Summary

A spider is a special kind of bot. A spider scans HTML pages and looks for more pages to

visit. A spider would theoretically continue finding URLs forever, or until it has visited every

URL on the Internet. However, there are two factors that limit a spider from doing this. First,

a spider is often given a maximum depth to visit. If a page is deeper, relative to the home page,

than this depth, the spider will not visit it. Secondly, spiders are often instructed to stay within

a specified set of hosts. This set is often just one host.

This chapter showed how to use the Heaton Research Spider. The Heaton Research

Spider is an open source spider, written in Java and C#, and is available for free from Heaton

Research, Inc. To use the Heaton Research Spider you must create two objects.

First, a SpiderOptions object must be created to provide the spider with some

basic configuration options. The SpiderOptions properties can either be set directly,

or loaded from a file.

Second, a WorkloadManager is also required. For simple spiders, you may

choose to use the MemoryWorkloadManager . This will store all URLs in the com-

puter's memory. For larger spiders, you should use the SQLWorkloadManager . The

SQLWorkloadManager stores the URL workload in an SQL database.

This chapter provided four recipes. The first recipe showed how to use a spider to check

for bad links on a web site. The second recipe showed how to use a spider to download a site.

The third recipe showed how to create a spider that accesses a large number of URLs that did

not restrict itself to a single host. The fourth recipe showed how to display the statistics from

the database, as a spider executes.

Now that you know how to use the Heaton Research Spider, the next chapter will take

you through the internals of how the Heaton Research Spider works. If you are content with

only using the Heaton Research Spider and do not wish to learn the internals of how to build

a spider yet, you may safely skip to Chapter 16 and learn how to create well behaved bots;

otherwise, continue through Chapters 14 and 15 and learn the internals of the Heaton Re-

search Spider.

Search WWH ::

Custom Search

Home