INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

this.spider.addURL(url, this.base, this.depth + 1);

} catch (WorkloadException e) {

throw new IOException(e.getMessage());

}

Some URLs require additional processing. The anchor tag is discussed in the next sec-

tion.

Adding an Anchor URL

Anchor tags sometimes have a prefix such as “javascript:”. These are not web addresses

and cannot be parsed through the URL class. The handleA method was designed to take

care of these prefixes. The handleA method begins by trimming the href value.

if (href != null) {

href = href.trim();

}

If the URL has any of the following known prefixes, then it will be ignored. Otherwise,

the URL will be added to the spider's workload.

if ((href != null) &&

!URLUtility.containsInvalidURLCharacters(href)) {

if (!href.toLowerCase().startsWith("javascript:")

&& !href.toLowerCase().startsWith("rstp:")

&& !href.toLowerCase().startsWith("rtsp:")

&& !href.toLowerCase().startsWith("news:")

&& !href.toLowerCase().startsWith("irc:")

&& !href.toLowerCase().startsWith("mailto:")) {

addURL(href, SpiderReportable.URLType.HYPERLINK);

}

This

allows

non-standard

URLs

to

be

ignored

without

throwing

a

MalformedURLException .

Spider Input Stream

The spider also includes an InputStream derived class named

SpiderInputStream . This stream works just like a regular InputStream , except

that it holds an OutputStream . This OutputStream is sent a copy of everything

read by the SpiderInputStream . This allows the raw HTML to be written out to a file

as it is parsed. The SpiderInputStream is shown in Listing 14.5.

Listing 14.5: Spider Input Stream (SpiderInputStream.java)

package com.heatonresearch.httprecipes.spider;

import java.io.*;

HTTP Programming Recipes for Java Bots

Search WWH ::

Custom Search

Home