EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

C HAPTER 6: E XTRACTING D ATA

• Parsing HTML

• Extracting from forms

• Extracting lists, images and hyperlinks

• Extracting data form multiple pages

The previous chapters showed how to extract simple data items from web pages. This

chapter will expand upon this. This chapter will focus on how to extract more complex data

structures from HTML messages. Of course HTML is not the only format to extract from.

Later chapters will discuss non-HTML formats, such as XML.

This chapter will present several recipes for extracting data from a variety of different

HTML forms.

• Extracting data spread across many HTML pages

• Extracting images

• Extracting hyperlinks

• Extracting data form HTML forms

• Extracting data form HTML lists

• Extracting data from HTML tables

Extracting data from these types of HTML structures is more complex than the simple

data extracted in previous chapters. To extract this data we will need an HTML parser. There

are three options for obtaining an HTML parser.

• Use the HTML parser built into Java Swing

• Use a third-party HTML parser

• Write your own HTML parser

Java includes a full-featured HTML parser, which is built into Swing. I've used this parser

for a number of projects. However, it has some limitations. The Swing HTML parser has

some issues with heavy multithreading. This can be a problem with certain spiders and bots

that must access a large number of HTML pages and make use of heavy multithreading.

Additionally, the swing HTML parser expects HTML to be properly formatted and well

defined. All HTML tags are defined as symbolic constants, and making tags unknown to the

Swing parser more difficult to process. In an ideal world all web sites would have beautifully

formatted and syntactically correct HTML. And in this world, the Swing parser would be

great. However, I've worked with several cases where a poorly formatted site causes great

confusion for the Swing Parser.

Search WWH ::

Custom Search

Home