Java Reference
In-Depth Information
this.spider.addURL(url, this.base, this.depth + 1);
} catch (WorkloadException e) {
throw new IOException(e.getMessage());
}
Some URLs require additional processing. The anchor tag is discussed in the next sec-
tion.
Adding an Anchor URL
Anchor tags sometimes have a prefix such as “javascript:”. These are not web addresses
and cannot be parsed through the URL class. The handleA method was designed to take
care of these prefixes. The handleA method begins by trimming the href value.
if (href != null) {
href = href.trim();
}
If the URL has any of the following known prefixes, then it will be ignored. Otherwise,
the URL will be added to the spider's workload.
if ((href != null) &&
!URLUtility.containsInvalidURLCharacters(href)) {
if (!href.toLowerCase().startsWith("javascript:")
&& !href.toLowerCase().startsWith("rstp:")
&& !href.toLowerCase().startsWith("rtsp:")
&& !href.toLowerCase().startsWith("news:")
&& !href.toLowerCase().startsWith("irc:")
&& !href.toLowerCase().startsWith("mailto:")) {
addURL(href, SpiderReportable.URLType.HYPERLINK);
}
}
This
allows
non-standard
URLs
to
be
ignored
without
throwing
a
MalformedURLException .
Spider Input Stream
The spider also includes an InputStream derived class named
SpiderInputStream . This stream works just like a regular InputStream , except
that it holds an OutputStream . This OutputStream is sent a copy of everything
read by the SpiderInputStream . This allows the raw HTML to be written out to a file
as it is parsed. The SpiderInputStream is shown in Listing 14.5.
Listing 14.5: Spider Input Stream (SpiderInputStream.java)
package com.heatonresearch.httprecipes.spider;
import java.io.*;
Search WWH ::




Custom Search