Java Reference
In-Depth Information
Parsing HTML
The
ParseHTML
class does HTML parsing. This class is used by all of the recipes
in this chapter. Additionally, many recipes through the remainder of the topic will use the
ParseHTML
class. I will begin by showing you how to use the
ParseHTML
class. A later
section will show you how the
ParseHTML
class was implemented.
Using ParseHTML
It is very easy to use the
ParseHTML
class. The following code fragment demon-
strates how to make use of the
ParseHTML
class.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
System.out.println("Read HTML tag: " + tag);
}
else
{
System.out.println("Read HTML text character: "
+ ((char)ch) );
}
}
As you can see from the above code an
InputStream
is acquired from a URL. This
InputStream
is used to construct a
ParseHTML
object. The
ParseHTML
class can
parse HTML from any
InputStream
object.
Next the code enters a loop calling
parse.read()
. Once
parse.read()
returns a negative one value, there is nothing more to parse, and the program ends. If
parse.read()
returns a zero, then an HTML tag was encountered. You can call
parse.getTag()
to determine which tag was encountered.
If neither a negative one or zero is returned, then a regular character has been found in
the HTML. This process continues until there is nothing else to read from the HTML file.
This is only a basic example of using
ParseHTML
. The recipes for this chapter will expand
on this greatly.