Java Reference
In-Depth Information
this.spider.addURL(url, this.base, this.depth + 1);
} catch (WorkloadException e) {
throw new IOException(e.getMessage());
}
Some URLs require additional processing. The anchor tag is discussed in the next sec-
tion.
Adding an Anchor URL
Anchor tags sometimes have a prefix such as “javascript:”. These are not web addresses
and cannot be parsed through the
URL
class. The
handleA
method was designed to take
care of these prefixes. The
handleA
method begins by trimming the
href
value.
if (href != null) {
href = href.trim();
}
If the URL has any of the following known prefixes, then it will be ignored. Otherwise,
the URL will be added to the spider's workload.
if ((href != null) &&
!URLUtility.containsInvalidURLCharacters(href)) {
if (!href.toLowerCase().startsWith("javascript:")
&& !href.toLowerCase().startsWith("rstp:")
&& !href.toLowerCase().startsWith("rtsp:")
&& !href.toLowerCase().startsWith("news:")
&& !href.toLowerCase().startsWith("irc:")
&& !href.toLowerCase().startsWith("mailto:")) {
addURL(href, SpiderReportable.URLType.HYPERLINK);
}
}
This
allows
non-standard
URLs
to
be
ignored
without
throwing
a
MalformedURLException
.
Spider Input Stream
The spider also includes an
InputStream
derived class named
SpiderInputStream
. This stream works just like a regular
InputStream
, except
that it holds an
OutputStream
. This
OutputStream
is sent a copy of everything
read by the
SpiderInputStream
. This allows the raw HTML to be written out to a file
as it is parsed. The
SpiderInputStream
is shown in Listing 14.5.
Listing 14.5: Spider Input Stream (SpiderInputStream.java)
package com.heatonresearch.httprecipes.spider;
import java.io.*;