Java Reference
In-Depth Information
ParseHTML parse = new ParseHTML(is);
The method loops across every tag and text character in the HTML file.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("a"))
{
When an
<a>
tag is located, its
href
attribute is examined.
value = tag.getAttributeValue("href");
A new
URL
object is created from the parent URL and the
href
value. This provides
the fully qualified URL for the sub-page.
URL u = new URL(url, value.toString());
The
processSubPage
method is then called for each sub-page.
value = u.toString();
processSubPage(u);
This method will loop through all sub-pages and call
processSubPage
for each.
Extracting from the Sub-Pages
Extracting data from the sub-pages is not very different than any of the other data extrac-
tion examples. The
extractSubPage
method begins by downloading the HTML page.
Next, the method attempts to locate the postal code.
String str = downloadPage(u, 5000);
String code = extractNoCase(str, "Code:<b></td><td>", "</td>", 0);
If no postal code is located, then we know that there is no US state information on this
page. There are several extra links on the parent page, that do not point to state sub-pages.
This allows us to quickly discard such pages.
The state's postal code is located by searching for the key text
Code:<b></td><td>
,
which occurs just before the postal code in the HTML file. You will also notice that we use a
new function, named
extractNoCase
. The
extractNoCase
function is very simi-
lar to the
extract
method introduced in Chapter 3. However,
extractNoCase
does
not require that the beginning and ending text strings match the case exactly on the HTML
page.
if (code != null)
{