EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

Parsing HTML

The ParseHTML class does HTML parsing. This class is used by all of the recipes

in this chapter. Additionally, many recipes through the remainder of the topic will use the

ParseHTML class. I will begin by showing you how to use the ParseHTML class. A later

section will show you how the ParseHTML class was implemented.

Using ParseHTML

It is very easy to use the ParseHTML class. The following code fragment demon-

strates how to make use of the ParseHTML class.

InputStream is = url.openStream();

ParseHTML parse = new ParseHTML(is);

int ch;

while ((ch = parse.read()) != -1)

{

if (ch == 0)

{

HTMLTag tag = parse.getTag();

System.out.println("Read HTML tag: " + tag);

}

else

{

System.out.println("Read HTML text character: "

+ ((char)ch) );

}

As you can see from the above code an InputStream is acquired from a URL. This

InputStream is used to construct a ParseHTML object. The ParseHTML class can

parse HTML from any InputStream object.

Next the code enters a loop calling parse.read() . Once parse.read()

returns a negative one value, there is nothing more to parse, and the program ends. If

parse.read() returns a zero, then an HTML tag was encountered. You can call

parse.getTag() to determine which tag was encountered.

If neither a negative one or zero is returned, then a regular character has been found in

the HTML. This process continues until there is nothing else to read from the HTML file.

This is only a basic example of using ParseHTML . The recipes for this chapter will expand

on this greatly.

Search WWH ::

Custom Search

Home