Java Reference
In-Depth Information
There are also several third-party HTML parsers available. However, it is really not too
complex to create a simple lightweight HTML parser. The idea of this topic is to present
many small examples of HTTP programming that the reader can implement in their own
programs. As a result, we will create our own HTML parser.
Implementing a HTML parser is not terribly complex. The HTML parser presented in
this chapter is implemented in three classes. Before getting to the recipes for this chapter
we will first examine the HTML parser. This HTML parser will be used by all of the recipes
in this chapter. The HTML parser is presented in the next few sections. If you are not inter-
ested in how to implement an HTML parser, you can easily skip to the recipes section of this
chapter. You do not need to know how the HTML parser was implemented in order to make
use of it.
Peekable InputStream
To properly parse any data, let alone HTML, it is very convenient to have a peekable
stream. A peekable stream is a regular Java InputStream , except that you can peek sev-
eral characters ahead, before actually reading these characters. First we will examine why it
is so convenient to use PeekableInputStream .
Consider parsing the following the following line of HTML.
<b>Hello World</b>
The first thing we would like to know is are we parsing an HTML tag or HTML text. Us-
ing the PeekableInputStream we can look at the first character and determine if we
are staring with a tag or text. Once we know that we are parsing text, we can begin reading
the actual text characters.
The PeekableInputStream class is also very useful for HTML comments. Con-
sider the following HTML comment:
<!--HTML Comment-->
To determine if something is an HTML comment you must look at the first four char-
acters of the tag. Using the PeekableInputStream we can examine the next four
characters and see if we are about to read a comment.
Using PeekableInputStream
Using the PeekableInputStream is very simple. The usage of
PeekableInputStream closely follows the usage of the Java class InputStream .
To use PeekableInputStream you must already have an InputStream . You
will then attach the PeekableInputStream to the existing InputStream . The
following code demonstrates this.
Search WWH ::




Custom Search