INSIDE THE HEATON RESEARCH SPIDER - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

Spider HTML Parsing

The Heaton Research Spider provides a SpiderHTMLParse object to the

spiderProcessURL method of a SpiderReportable object. This object allows

the HTML found by the spider to be parsed. However, it also allows the spider to extract links

from the HTML. The SpiderHTMLParse class is shown in Listing 14.4.

Listing 14.4: HTML Parsing (SpiderHTMLParse.java)

package com.heatonresearch.httprecipes.spider;

import java.io.*;

import java.net.*;

import java.util.logging.*;

import com.heatonresearch.httprecipes.html.*;

import com.heatonresearch.httprecipes.spider.workload.*;

public class SpiderParseHTML extends ParseHTML {

/**

* The logger.

*/

private static Logger logger = Logger

.getLogger("com.heatonresearch.httprecipes.spider.Spider-

ParseHTML");

/**

* The Spider that this page is being parsed for.

*/

private Spider spider;

/**

* The URL that is being parsed.

*/

private URL base;

/**

* The depth of the page being parsed.

*/

private int depth;

/**

* The InputStream that is being parsed.

*/

private SpiderInputStream stream;

Search WWH ::

Custom Search

Home