Database Reference
In-Depth Information
$ zcat *.json.gz |
> jq -r '.borough' |
> tr '[A-Z] ' '[a-z]_' |
> sort | uniq -c |
> awk '{print $2","$1}' |
> header -a borough,count |
> csvsort -rc count | csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Because this is quite a long pipeline, and because we're using it again in a moment
with parallel , it's worth reviewing:
Expand all compressed files using zcat
For each call, extract the name of the borough using jq
Convert borough names to lowercase and replace spaces with underscores
(because awk splits on whitespace by default)
Count the occurrences of each borough using sort and uniq
Reverse the fields count and borough and make it comma delimited using awk
Add a header using header
Sort by count using csvsort (Groskopf, 2014) and print a table using csvlook
Imagine, for a moment, that our own machine is so slow that we simply cannot per‐
form this pipeline locally. We can use GNU Parallel to distribute the local files among
the remote machines, let them do the processing, and retrieve the results:
$ ls *.json.gz |
> parallel -v --basefile jq \
> --trc { . } .csv \
> --slf instances \
> "zcat {} | ./jq -r '.borough' | tr '[A-Z] ' '[a-z]_' | sort | uniq -c |" \
> " awk '{print \$2\",\"\$1}' > {.}.csv"
zcat 10.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 2.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 1.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
Search WWH ::




Custom Search