Database Reference
In-Depth Information
zcat 3.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 4.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 5.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 6.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 7.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 8.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
zcat 9.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'
This long command breaks down as follows:
Print the list of files using
ls
and pipe it into
parallel
.
Transmit the
jq
binary to each remote machine. (Luckily,
jq
has no dependen‐
cies.) This file will be removed from the remote machine at the end because we
specified the
--trc
option (which implies the
--cleanup
option).
The
--trc {.}.csv
option is short for
--transfer --return {.}.csv --
cleanup
. (The replacement string
{.}
gets replaced with the input filename
without the last extension.) Here, this means that the JSON file gets transferred
to the remote machine, the CSV file gets returned to the local machine, and both
files will be removed from the remote machine after each job.
Specify a list of hostnames. Remember, if you want to try this out locally, you can
specify
--sshlogin :
instead of
--self instances
.
Note the escaping in the
awk
expression. Quoting can sometimes be tricky. Here,
the dollar signs and the double quotes are escaped. If quoting ever gets too con‐
fusing, remember that you can put turn pipeline into a separate command-line
tool just as we did with
sum
.
If we, at some point during this command, run
ls
on one of the remote machines, we
would see that
parallel
indeed transfers (and cleans up) the binary
jq
, the JSON
files, and CSV files:
$
ssh
$(
head -n 1 instances
)
ls
1.json.csv
1.json.gz
jq
Each CSV file looks like this:
$
cat 1.json.csv
bronx,3
brooklyn,5
manhattan,24
queens,3
staten_island,2
unspecified,63