EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

if (!first)

{

URL urlOfficial = new URL(url, value);

URL urlFlag = new URL(url, src);

processItem(urlOfficial, urlFlag);

} else

first = false;

If a tag was not found add the text to the buffer .

} else

{

buffer.append((char) ch);

}

Finally, return the next page, if it will found.

return result;

This function will continue returning the next page until it has reached the end of all 50

states.

Summary

This chapter showed you how to extract data from HTML. Most of the data that a bot

would like to access will be in HTML form. Previous chapters showed how to extract data

from simple HTML constructs, this chapter expanded on that considerably.

The chapter began by showing you how to create an HTML parser. This HTML parser

is fairly short in length, but it can handle any HTML file, even if not properly formatted. The

HTML parser built into Java can run into issues with improperly formatted HTML. Unfortu-

nately, there is a fair amount of improperly formatted HTML on the web.

HTML pages can come in a variety of formats. This chapter included seven recipes to

show you how to extract data from many of these formats. You were shown how to extract

hyperlinks, images, forms, and from multiple pages.

So far the recipes in this topic have mostly just downloaded data from a web server.

There has not been much interactivity with the web server. In the next chapter you will see

how a bot can send form data to a web server. This allows the bot to interact with the web

server just like a human using a form.

Search WWH ::

Custom Search

Home