Java Reference
In-Depth Information
</tr>
<tr>
<td>Alabama</td>
<td>AL</td>
<td>Montgomery</td>
<td>
<a href="http://www.alabama.gov/">http://www.alabama.gov/
</a></td>
</tr>
<tr>
<td>Alaska</td>
<td>AK</td>
<td>Juneau</td>
<td>
<a href="http://www.state.ak.us/">http://www.state.ak.us/
</a></td>
</tr>
...
<tr>
<td>Wyoming</td>
<td>WY</td>
<td>Cheyenne</td>
<td><a href="http://wyoming.gov/">http://wyoming.gov/</a></td>
</tr>
</table>
The data that we will parse is located between the
<td>
and
</td>
tags. However, the
other tags tell us which row the data belongs to.
Parsing the Table
The table is parsed by the
process
method of the
ParseTable
class. This method
begins by opening an
InputStream
to the URL that contains the table. A
ParseHTML
object is created to parse this
InputStream
. A variable named
buffer
is created to
hold the data for each table cell. A variable named
list
is created to hold each column
of data for a row. A variable named
capture
is used to keep track of if we are capturing
HTML text into the buffer variable or not. Capturing will occur when we are between
<td>
and
</td>
tags.
InputStream is = url.openStream();
ParseHTML parse = new ParseHTML(is);
StringBuilder buffer = new StringBuilder();
List<String> list = new ArrayList<String>();
boolean capture = false;
The
advance
method will take us to the correct table in the HTML page. The advance
method is discussed in Recipe 6.1.