Java Reference
In-Depth Information
String href = tag.getAttributeValue("href");
handleA(href);
} else if (tag.getName().equalsIgnoreCase("img")) {
String src = tag.getAttributeValue("src");
addURL(src, SpiderReportable.URLType.IMAGE);
} else if (tag.getName().equalsIgnoreCase("style")) {
String src = tag.getAttributeValue("src");
addURL(src, SpiderReportable.URLType.STYLE);
} else if (tag.getName().equalsIgnoreCase("link")) {
String href = tag.getAttributeValue("href");
addURL(href, SpiderReportable.URLType.SCRIPT);
} else if (tag.getName().equalsIgnoreCase("base")) {
String href = tag.getAttributeValue("href");
this.base = new URL(this.base, href);
}
}
return result;
For most tag types, the
addURL
method will be called. However, the anchor tag is
handled differently with a call to the
handleA
method.
Adding a URL
The
addURL
method is called to add a URL. It begins by rejecting any
null
URLs.
if (u == null) {
return;
}
First, the URL is converted to the fully qualified form. For example, if the
href
of “images/me.gif” were found on the page, then the fully qualified URL would be
http://www.httprecipes.com/1/images/me.gif
.
try {
URL url = URLUtility.constructURL(this.base, u, true);
url = this.spider.getWorkloadManager().convertURL(url.to-
String());
Next, the protocol is checked. If the URL's protocol is anything other than
http
or
https
, the URL is ignored.
if (url.getProtocol().equalsIgnoreCase("http")
|| url.getProtocol().equalsIgnoreCase("https")) {
The
spiderFoundURL
function is then called to determine if the URL should be
added. If the URL should be added, then the spider's
addURL
method is called.
if (this.spider.getReport().spiderFoundURL(url, this.base, type))
{
try {