Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

$ zcat *.json.gz |

> jq -r '.borough' |

> tr '[A-Z] ' '[a-z]_' |

> sort | uniq -c |

> awk '{print $2","$1}' |

> header -a borough,count |

> csvsort -rc count | csvlook

|----------------+--------|

| borough | count |

|----------------+--------|

| unspecified | 467 |

| manhattan | 274 |

| brooklyn | 103 |

| queens | 77 |

| bronx | 44 |

| staten_island | 35 |

|----------------+--------|

Because this is quite a long pipeline, and because we're using it again in a moment

with parallel , it's worth reviewing:

Expand all compressed files using zcat

For each call, extract the name of the borough using jq

Convert borough names to lowercase and replace spaces with underscores

(because awk splits on whitespace by default)

Count the occurrences of each borough using sort and uniq

Reverse the fields count and borough and make it comma delimited using awk

Add a header using header

Sort by count using csvsort (Groskopf, 2014) and print a table using csvlook

Imagine, for a moment, that our own machine is so slow that we simply cannot per‐

form this pipeline locally. We can use GNU Parallel to distribute the local files among

the remote machines, let them do the processing, and retrieve the results:

$ ls *.json.gz |

> parallel -v --basefile jq \

> --trc { . } .csv \

> --slf instances \

> "zcat {} | ./jq -r '.borough' | tr '[A-Z] ' '[a-z]_' | sort | uniq -c |" \

> " awk '{print \$2\",\"\$1}' > {.}.csv"

zcat 10.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'

zcat 2.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'

zcat 1.json.gz | ./jq -r '.borough' | sort | uniq -c | awk '{print $2","$1}'

Search WWH ::

Custom Search

Home