Database Reference
In-Depth Information
{
"td": [
The reason we convert the HTML to JSON is because there is a very powerful tool
called
jq
that operates on JSON data. The following command extracts certain parts
of the JSON data and reshapes it into a form that we can work with:
$
< data/table.json jq -c
'.html.body.tr[] | {country: .td[1][],border:'
\
>
'.td[2][], surface: .td[3][]}'
> countries.json
$
head -n 10 data/countries.json
{"surface":"0.44","border":"3.2","country":"Vatican City"}
{"surface":"2","border":"4.4","country":"Monaco"}
{"surface":"61","border":"39","country":"San Marino"}
{"surface":"160","border":"76","country":"Liechtenstein"}
{"surface":"34","border":"10.2","country":"Sint Maarten (Netherlands)"}
{"surface":"468","border":"120.3","country":"Andorra"}
{"surface":"6","border":"1.2","country":"Gibraltar (United Kingdom)"}
{"surface":"54","border":"10.2","country":"Saint Martin (France)"}
{"surface":"2586","border":"359","country":"Luxembourg"}
{"surface":"6220","border":"466","country":"Palestinian territories"}
Now we're getting somewhere. JSON is a very popular data format with many advanā
tages, but for our purposes we're better off having the data in CSV format. The tool
json2csv
is able to convert the data from JSON to CSV:
$
< countries.json json2csv -p -k border,surface > countries.csv
$
head -n 11 countries.csv | csvlook
|---------+----------|
| border | surface |
|---------+----------|
| 3.2 | 0.44 |
| 4.4 | 2 |
| 39 | 61 |
| 76 | 160 |
| 10.2 | 34 |
| 120.3 | 468 |
| 1.2 | 6 |
| 10.2 | 54 |
| 359 | 2586 |
| 466 | 6220 |
|---------+----------|
The data is now in a form that we can work with. Those were quite a few steps to get
from a Wikipedia page to a CSV data set. However, when you combine all of these
commands into one, you will see that it's actually really concise and expressive:
$
curl -sL
'http://en.wikipedia.org/wiki/List_of_countries'
\
>
'_and_territories_by_border/area_ratio'
|
>
scrape -be
'table.wikitable > tr:not(:first-child)'
|
>
xml2json | jq -c
'.html.body.tr[] | {country: .td[1][],'
\
>
'border: .td[2][], surface: .td[3][], ratio: .td[4][]}'
|
>
json2csv -p -k
=
border,surface | head -n 11 | csvlook
|---------+----------|