Java Reference
In-Depth Information
The ParseHTML class makes use of three instance variables to track HTML parsing.
These variables are shown here.
private PeekableInputStream source;
private HTMLTag tag;
private static Map<String, Character> charMap;
As you can see, all three variables are private. The source variable holds the
PeekableInputStream that is being parsed. The tag variable holds the last HTML
tag found by the parser. The charMap variable holds a mapping between HTML encoded
characters, such as &nbsp; , and their character code.
We will now examine each of the functions in the next section.
The Constructor
The ParseHTML class's constructor was two responsibilities. The first is to create a
new PeekableInputStream object based on the InputStream that was passed
to the constructor. The second is to initialize the charMap variable, if it has not already
been initialized.
source = new PeekableInputStream(is);
if (charMap == null)
{
charMap = new HashMap<String, Character>();
charMap.put("nbsp", ' ');
charMap.put("lt", '<');
charMap.put("gt", '>');
charMap.put("amp", '&');
charMap.put("quot", '\"');
}
In HTML encoding there are two ways to store several of the more common charac-
ters. For example the double quote character can be stored by its ASCII character value as
&#34; or as &quot; . The ASCII character codes are easy to parse, as you simply extract
their numeric values and convert them to characters.
As you can see from the above code, each of the special characters are loaded into a
Map , which will allow the parseSepcialCharacter method, which will be dis-
cussed later, to quickly access them.
Removing White Space with eatWhiteSpace
HTML documents generally have quite a bit of extra white space. This white space has
nothing to do with the display, and is useless to the computer. However, the white space
makes the HTML source code easier to read for a human. White space is the extra spaces,
carriage returns and tabs placed in an HTML document.
Search WWH ::




Custom Search