Java Reference
In-Depth Information
options.load("spider.conf");
Spider spider = new Spider(options,report);
spider.addURL(base, null, 1);
spider.process();
System.out.println(spider.getStatus());
First, a variable named base is created that contains the base URL with which
the spider begins. Next, a SimpleReport object is created named report . The
SimpleReport class implements a spider ReportableInterface , and is pro-
vided by the Heaton Research Spider. However, the SimpleReport does nothing more
than allow the spider to continue crawling, no data is processed. It is suitable only for testing
the spider.
Next, a SpiderOptions object, named options is created. The options ob-
ject loads configuration data from a file named spider.conf . Now that we have both a
configuration and report object, we can create a Spider object named spider .
Finally, the base URL is added to the spider object, and the process method is
called. The process method will not return until the spider is finished. If you wish to
cancel the spider processing early, you must call the cancel method on the spider
object.
Of course you could also directly create the SpiderOptions object, as discussed
earlier in the chapter. To set options directly, simply remove the call to the load method
and set each of the properties of the options object.
Recipes
This chapter includes four recipes. These recipes demonstrate how to construct spiders
that check links, download sites, and which attempt to access a large number of sites. Addi-
tionally, a recipe is provided that tracks the progress of a spider. Specifically, you will see how
to perform the following techniques:
• Find Site Broken Links
• Download Site Contents
• Access Numerous Internet Sites
• Track Spider Progress
These recipes will also show you how a bot can be adapted to perform several very com-
mon spider techniques. The first recipe below demonstrates how to find bad links.
Recipe #13.1: Broken Links
A broken link is a link on a web site that leads to a non-existent page or image. Broken
links make a web site look unprofessional. Spiders are particularly adept at finding broken
links on a web site. This recipe shows how to create a spider that will scan a web site for
broken links.
Search WWH ::




Custom Search