Java Reference
In-Depth Information
C HAPTER 6: E XTRACTING D ATA
• Parsing HTML
• Extracting from forms
• Extracting lists, images and hyperlinks
• Extracting data form multiple pages
The previous chapters showed how to extract simple data items from web pages. This
chapter will expand upon this. This chapter will focus on how to extract more complex data
structures from HTML messages. Of course HTML is not the only format to extract from.
Later chapters will discuss non-HTML formats, such as XML.
This chapter will present several recipes for extracting data from a variety of different
HTML forms.
• Extracting data spread across many HTML pages
• Extracting images
• Extracting hyperlinks
• Extracting data form HTML forms
• Extracting data form HTML lists
• Extracting data from HTML tables
Extracting data from these types of HTML structures is more complex than the simple
data extracted in previous chapters. To extract this data we will need an HTML parser. There
are three options for obtaining an HTML parser.
• Use the HTML parser built into Java Swing
• Use a third-party HTML parser
• Write your own HTML parser
Java includes a full-featured HTML parser, which is built into Swing. I've used this parser
for a number of projects. However, it has some limitations. The Swing HTML parser has
some issues with heavy multithreading. This can be a problem with certain spiders and bots
that must access a large number of HTML pages and make use of heavy multithreading.
Additionally, the swing HTML parser expects HTML to be properly formatted and well
defined. All HTML tags are defined as symbolic constants, and making tags unknown to the
Swing parser more difficult to process. In an ideal world all web sites would have beautifully
formatted and syntactically correct HTML. And in this world, the Swing parser would be
great. However, I've worked with several cases where a poorly formatted site causes great
confusion for the Swing Parser.
 
Search WWH ::




Custom Search