INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

}

if ((href != null) &&

!URLUtility.containsInvalidURLCharacters(href)) {

if (!href.toLowerCase().startsWith("javascript:")

&& !href.toLowerCase().startsWith("rstp:")

&& !href.toLowerCase().startsWith("rtsp:")

&& !href.toLowerCase().startsWith("news:")

&& !href.toLowerCase().startsWith("irc:")

&& !href.toLowerCase().startsWith("mailto:")) {

addURL(href, SpiderReportable.URLType.HYPERLINK);

}

There are several methods and functions that make up the SpiderParseHTML

class. These will be discussed in the next sections.

Constructing a SpiderParseHTML Object

The constructor for the SpiderParseHTML class is relatively simple. It accepts

several parameters and uses them to initialize the object. As you can see from the following

lines of code, each of the instance variables is initialized in the constructor.

super(is);

this.stream = is;

this.spider = spider;

this.base = base;

this.depth = spider.getWorkloadManager().getDepth(base);

Once the instance variables are initialized the SpiderParseHTML object is ready

for use.

Reading Data from a SpiderParseHTML Object

The read function is called to read individual characters as the HTML is parsed. This

works the same as a regular ParseHTML object. The ParseHTML class was covered

in Chapter 6, “Extracting Data”. The read function begins by calling the parent's read

function.

int result = super.read();

if (result == 0) {

If the read function returns zero, then a tag was found. The tag is checked to see if it

matches any of the tag types that contain a link.

HTMLTag tag = getTag();

if (tag.getName().equalsIgnoreCase("a")) {

Search WWH ::

Custom Search

Home