Database Reference
In-Depth Information
{
"td": [
The reason we convert the HTML to JSON is because there is a very powerful tool
called jq that operates on JSON data. The following command extracts certain parts
of the JSON data and reshapes it into a form that we can work with:
$ < data/table.json jq -c '.html.body.tr[] | {country: .td[1][],border:' \
> '.td[2][], surface: .td[3][]}' > countries.json
$ head -n 10 data/countries.json
{"surface":"0.44","border":"3.2","country":"Vatican City"}
{"surface":"2","border":"4.4","country":"Monaco"}
{"surface":"61","border":"39","country":"San Marino"}
{"surface":"160","border":"76","country":"Liechtenstein"}
{"surface":"34","border":"10.2","country":"Sint Maarten (Netherlands)"}
{"surface":"468","border":"120.3","country":"Andorra"}
{"surface":"6","border":"1.2","country":"Gibraltar (United Kingdom)"}
{"surface":"54","border":"10.2","country":"Saint Martin (France)"}
{"surface":"2586","border":"359","country":"Luxembourg"}
{"surface":"6220","border":"466","country":"Palestinian territories"}
Now we're getting somewhere. JSON is a very popular data format with many advanā€
tages, but for our purposes we're better off having the data in CSV format. The tool
json2csv is able to convert the data from JSON to CSV:
$ < countries.json json2csv -p -k border,surface > countries.csv
$ head -n 11 countries.csv | csvlook
|---------+----------|
| border | surface |
|---------+----------|
| 3.2 | 0.44 |
| 4.4 | 2 |
| 39 | 61 |
| 76 | 160 |
| 10.2 | 34 |
| 120.3 | 468 |
| 1.2 | 6 |
| 10.2 | 54 |
| 359 | 2586 |
| 466 | 6220 |
|---------+----------|
The data is now in a form that we can work with. Those were quite a few steps to get
from a Wikipedia page to a CSV data set. However, when you combine all of these
commands into one, you will see that it's actually really concise and expressive:
$ curl -sL 'http://en.wikipedia.org/wiki/List_of_countries' \
> '_and_territories_by_border/area_ratio' |
> scrape -be 'table.wikitable > tr:not(:first-child)' |
> xml2json | jq -c '.html.body.tr[] | {country: .td[1][],' \
> 'border: .td[2][], surface: .td[3][], ratio: .td[4][]}' |
> json2csv -p -k = border,surface | head -n 11 | csvlook
|---------+----------|
Search WWH ::




Custom Search