Java Reference
In-Depth Information
options.load("spider.conf");
Spider spider = new Spider(options,report);
spider.addURL(base, null, 1);
spider.process();
System.out.println(spider.getStatus());
First, a variable named
base
is created that contains the
base
URL with which
the spider begins. Next, a
SimpleReport
object is created named
report
. The
SimpleReport
class implements a spider
ReportableInterface
, and is pro-
vided by the Heaton Research Spider. However, the
SimpleReport
does nothing more
than allow the spider to continue crawling, no data is processed. It is suitable only for testing
the spider.
Next, a
SpiderOptions
object, named
options
is created. The
options
ob-
ject loads configuration data from a file named
spider.conf
. Now that we have both a
configuration and report object, we can create a Spider object named
spider
.
Finally, the
base
URL is added to the spider object, and the
process
method is
called. The
process
method will not return until the spider is finished. If you wish to
cancel the spider processing early, you must call the
cancel
method on the
spider
object.
Of course you could also directly create the
SpiderOptions
object, as discussed
earlier in the chapter. To set options directly, simply remove the call to the
load
method
and set each of the properties of the
options
object.
Recipes
This chapter includes four recipes. These recipes demonstrate how to construct spiders
that check links, download sites, and which attempt to access a large number of sites. Addi-
tionally, a recipe is provided that tracks the progress of a spider. Specifically, you will see how
to perform the following techniques:
• Find Site Broken Links
• Download Site Contents
• Access Numerous Internet Sites
• Track Spider Progress
These recipes will also show you how a bot can be adapted to perform several very com-
mon spider techniques. The first recipe below demonstrates how to find bad links.
Recipe #13.1: Broken Links
A broken link is a link on a web site that leads to a non-existent page or image. Broken
links make a web site look unprofessional. Spiders are particularly adept at finding broken
links on a web site. This recipe shows how to create a spider that will scan a web site for
broken links.