Java Reference
In-Depth Information
ParseHTML parse = new ParseHTML(is);
The method loops across every tag and text character in the HTML file.
int ch;
while ((ch = parse.read()) != -1)
{
if (ch == 0)
{
HTMLTag tag = parse.getTag();
if (tag.getName().equalsIgnoreCase("a"))
{
When an <a> tag is located, its href attribute is examined.
value = tag.getAttributeValue("href");
A new URL object is created from the parent URL and the href value. This provides
the fully qualified URL for the sub-page.
URL u = new URL(url, value.toString());
The processSubPage method is then called for each sub-page.
value = u.toString();
processSubPage(u);
This method will loop through all sub-pages and call processSubPage for each.
Extracting from the Sub-Pages
Extracting data from the sub-pages is not very different than any of the other data extrac-
tion examples. The extractSubPage method begins by downloading the HTML page.
Next, the method attempts to locate the postal code.
String str = downloadPage(u, 5000);
String code = extractNoCase(str, "Code:<b></td><td>", "</td>", 0);
If no postal code is located, then we know that there is no US state information on this
page. There are several extra links on the parent page, that do not point to state sub-pages.
This allows us to quickly discard such pages.
The state's postal code is located by searching for the key text Code:<b></td><td> ,
which occurs just before the postal code in the HTML file. You will also notice that we use a
new function, named extractNoCase . The extractNoCase function is very simi-
lar to the extract method introduced in Chapter 3. However, extractNoCase does
not require that the beginning and ending text strings match the case exactly on the HTML
page.
if (code != null)
{
Search WWH ::




Custom Search