Database Reference
In-Depth Information
<tr><td>2</td>
<td>Monaco</td>
<td>4.4</td>
<td>2</td>
<td>2.2000000</td>
</tr>
<tr><td>3</td>
<td>San Marino</td>
<td>39</td>
<td>61</td>
<td>0.6393443</td>
</tr>
The value passed to the -e option, which stands for expression , is a so-called CSS
selector. The syntax is usually used to style web pages, but we can also use it to select
certain elements from our HTML. In this case, we wish to select all <tr> elements or
rows (except the first) that are part of a table which belongs to the wikitable class.
This is precisely the table that we're interested in. The reason that we don't want the
first row (specified by :not(first-child) ) is that we don't want the header of the
table. This results in a data set where each row represents a country or territory. As
you can see, we now have the <tr> elements that we're looking for, encapsulated in
<html> and <body> elements (because we specified the -b option). This ensures that
our next tool, xml2json , can work with it.
As its name implies, xml2json converts XML (and HTML) to JSON.
$ < table.html xml2json > table.json
$ < table.json jq '.' | head -n 25
{
"html": {
"body": {
"tr": [
{
"td": [
{
"$t": "1"
},
{
"$t": "Vatican City"
},
{
"$t": "3.2"
},
{
"$t": "0.44"
},
{
"$t": "7.2727273"
}
]
},
Search WWH ::




Custom Search