EXTRACTING DATA - HTTP Programming Recipes for Java Bots

Java Reference

In-Depth Information

</tr>

<tr>

<td>Alabama</td>

<td>Montgomery</td>

<td>

<a href="http://www.alabama.gov/">http://www.alabama.gov/

</a></td>

</tr>

<tr>

<td>Alaska</td>

<td>Juneau</td>

<td>

<a href="http://www.state.ak.us/">http://www.state.ak.us/

</a></td>

</tr>

...

<tr>

<td>Wyoming</td>

<td>Cheyenne</td>

<td><a href="http://wyoming.gov/">http://wyoming.gov/</a></td>

</tr>

</table>

The data that we will parse is located between the <td> and </td> tags. However, the

other tags tell us which row the data belongs to.

Parsing the Table

The table is parsed by the process method of the ParseTable class. This method

begins by opening an InputStream to the URL that contains the table. A ParseHTML

object is created to parse this InputStream . A variable named buffer is created to

hold the data for each table cell. A variable named list is created to hold each column

of data for a row. A variable named capture is used to keep track of if we are capturing

HTML text into the buffer variable or not. Capturing will occur when we are between <td>

and </td> tags.

InputStream is = url.openStream();

ParseHTML parse = new ParseHTML(is);

StringBuilder buffer = new StringBuilder();

List<String> list = new ArrayList<String>();

boolean capture = false;

The advance method will take us to the correct table in the HTML page. The advance

method is discussed in Recipe 6.1.

Search WWH ::

Custom Search

Home