Java Reference
In-Depth Information
if (!first)
{
URL urlOfficial = new URL(url, value);
URL urlFlag = new URL(url, src);
processItem(urlOfficial, urlFlag);
} else
first = false;
If a tag was not found add the text to the buffer .
} else
{
buffer.append((char) ch);
}
Finally, return the next page, if it will found.
return result;
This function will continue returning the next page until it has reached the end of all 50
states.
Summary
This chapter showed you how to extract data from HTML. Most of the data that a bot
would like to access will be in HTML form. Previous chapters showed how to extract data
from simple HTML constructs, this chapter expanded on that considerably.
The chapter began by showing you how to create an HTML parser. This HTML parser
is fairly short in length, but it can handle any HTML file, even if not properly formatted. The
HTML parser built into Java can run into issues with improperly formatted HTML. Unfortu-
nately, there is a fair amount of improperly formatted HTML on the web.
HTML pages can come in a variety of formats. This chapter included seven recipes to
show you how to extract data from many of these formats. You were shown how to extract
hyperlinks, images, forms, and from multiple pages.
So far the recipes in this topic have mostly just downloaded data from a web server.
There has not been much interactivity with the web server. In the next chapter you will see
how a bot can send form data to a web server. This allows the bot to interact with the web
server just like a human using a form.
Search WWH ::




Custom Search