Database Reference
In-Depth Information
$
zcat *.json.gz |
>
jq -r
'.borough'
|
>
tr
'[A-Z] '
'[a-z]_'
|
>
sort | uniq -c |
>
awk
'{print $2","$1}'
|
>
header -a borough,count |
>
csvsort -rc count | csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Because this is quite a long pipeline, and because we're using it again in a moment
with
parallel
, it's worth reviewing:
Expand all compressed files using
zcat
For each call, extract the name of the borough using
jq
Convert borough names to lowercase and replace spaces with underscores
(because
awk
splits on whitespace by default)
Count the occurrences of each borough using
sort
and
uniq
Reverse the fields
count
and
borough
and make it comma delimited using
awk
Add a header using
header
Sort by
count
using
csvsort
(Groskopf, 2014) and print a table using
csvlook
Imagine, for a moment, that our own machine is so slow that we simply cannot per‐
form this pipeline locally. We can use GNU Parallel to distribute the local files among
the remote machines, let them do the processing, and retrieve the results:
$
ls *.json.gz |
>
parallel -v --basefile jq
\
>
--trc
{
.
}
.csv
\
>
--slf instances
\
>
"zcat {} | ./jq -r '.borough' | tr '[A-Z] ' '[a-z]_' | sort | uniq -c |"
\
>
" awk '{print \$2\",\"\$1}' > {.}.csv"
zcat 10.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 2.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 1.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'