Java Reference
In-Depth Information
}
if ((href != null) &&
!URLUtility.containsInvalidURLCharacters(href)) {
if (!href.toLowerCase().startsWith("javascript:")
&& !href.toLowerCase().startsWith("rstp:")
&& !href.toLowerCase().startsWith("rtsp:")
&& !href.toLowerCase().startsWith("news:")
&& !href.toLowerCase().startsWith("irc:")
&& !href.toLowerCase().startsWith("mailto:")) {
addURL(href, SpiderReportable.URLType.HYPERLINK);
}
}
}
}
There are several methods and functions that make up the
SpiderParseHTML
class. These will be discussed in the next sections.
Constructing a SpiderParseHTML Object
The constructor for the
SpiderParseHTML
class is relatively simple. It accepts
several parameters and uses them to initialize the object. As you can see from the following
lines of code, each of the instance variables is initialized in the constructor.
super(is);
this.stream = is;
this.spider = spider;
this.base = base;
this.depth = spider.getWorkloadManager().getDepth(base);
Once the instance variables are initialized the
SpiderParseHTML
object is ready
for use.
Reading Data from a SpiderParseHTML Object
The
read
function is called to read individual characters as the HTML is parsed. This
works the same as a regular
ParseHTML
object. The
ParseHTML
class was covered
in Chapter 6, “Extracting Data”. The
read
function begins by calling the parent's
read
function.
int result = super.read();
if (result == 0) {
If the
read
function returns zero, then a tag was found. The tag is checked to see if it
matches any of the tag types that contain a link.
HTMLTag tag = getTag();
if (tag.getName().equalsIgnoreCase("a")) {