Java Reference
In-Depth Information
}
if ((href != null) &&
!URLUtility.containsInvalidURLCharacters(href)) {
if (!href.toLowerCase().startsWith("javascript:")
&& !href.toLowerCase().startsWith("rstp:")
&& !href.toLowerCase().startsWith("rtsp:")
&& !href.toLowerCase().startsWith("news:")
&& !href.toLowerCase().startsWith("irc:")
&& !href.toLowerCase().startsWith("mailto:")) {
addURL(href, SpiderReportable.URLType.HYPERLINK);
}
}
}
}
There are several methods and functions that make up the SpiderParseHTML
class. These will be discussed in the next sections.
Constructing a SpiderParseHTML Object
The constructor for the SpiderParseHTML class is relatively simple. It accepts
several parameters and uses them to initialize the object. As you can see from the following
lines of code, each of the instance variables is initialized in the constructor.
super(is);
this.stream = is;
this.spider = spider;
this.base = base;
this.depth = spider.getWorkloadManager().getDepth(base);
Once the instance variables are initialized the SpiderParseHTML object is ready
for use.
Reading Data from a SpiderParseHTML Object
The read function is called to read individual characters as the HTML is parsed. This
works the same as a regular ParseHTML object. The ParseHTML class was covered
in Chapter 6, “Extracting Data”. The read function begins by calling the parent's read
function.
int result = super.read();
if (result == 0) {
If the read function returns zero, then a tag was found. The tag is checked to see if it
matches any of the tag types that contain a link.
HTMLTag tag = getTag();
if (tag.getName().equalsIgnoreCase("a")) {
Search WWH ::




Custom Search