Database Reference
In-Depth Information
command-line tools. Moreover, the other formats are simply younger. Each of these
formats can be treated as plain text, allowing us to apply such command-line tools to
the other formats as well.
Sometimes we can get away with applying the classic tools to structured data. For
example, by treating the JSON data as plain text, we can change the attribute “gender”
to “sex” using sed :
$ sed -e 's/"gender":/"sex":/g' data/users.json | fold | head -n 3
{"results":[{"user":{"sex":"female","name":{"title":"mrs","first":"kaylee","last
":"anderson"},"location":{"street":"1779 washington ave","city":"cupertino","sta
te":"michigan","zip":"13931"},"email":"kaylee.anderson64@example.com","password"
Like many other command-line tools, sed does not make use of the structure of the
data. Because of this, it's better to use a command-line tool that makes use of the
structure of the data (as we will do with jq ), or first convert the data to a tabular for‐
mat such as CSV and then apply the appropriate command-line tool.
Next, we're going to demonstrate converting HTML/XML and JSON to CSV through
a real-world use case. The command-line tools that we'll be using here are: curl ,
scrape (Janssens, 2014), xml2json (Parmentier, 2014), jq (Dolan, 2014), and
json2csv (Czebotar, 2014).
Wikipedia holds a wealth of information. Much of this information is ordered in
tables, which can be regarded as data sets. For example, the page http://en.wikipe
dia.org/wiki/List_of_countries_and_territories_by_border/area_ratio contains a list of
countries and territories together with their border length, their area, and the ratio
between the two. Let's imagine that we're interested in analyzing this data set. In this
section, we'll walk you through all the necessary steps and their corresponding
commands.
The data set that we're interested in is embedded in HTML. Our goal is to end up
with a representation of this data set that we can work with. The very first step is to
download the HTML using curl :
$ curl -sL 'http://en.wikipedia.org/wiki/List_of_countries_and_territories_' \
> 'by_border/area_ratio' > wiki.html
The option -s causes curl to be silent and not output any other information but the
actual HTML. The HTML is saved to a file named data/wiki.html . Here's what the
first 10 lines look like:
$ head -n 10 data/wiki.html | cut -c1-79
<!DOCTYPE html>
<html lang="en" dir="ltr" class="client-nojs">
<head>
<meta charset="UTF-8" /><title>List of countries and territories by border/area
<meta http-equiv="X-UA-Compatible" content="IE=EDGE" /><meta name="generator" c
<link rel="alternate" type="application/x-wiki" title="Edit this page" href="/w
Search WWH ::




Custom Search