Database Reference
In-Depth Information
We can sum the counts in each CSV file using Rio and the aggregate function in R:
$ cat *.csv | header -a borough,count |
> Rio -e 'aggregate(count ~ borough, df, sum)' |
> csvsort -rc count | csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Or, if you prefer to use SQL to aggregate results, you can use csvsql as discussed in
Chapter 5 :
$ cat *.csv | header -a borough,count |
> csvsql --query 'SELECT borough, SUM(count) AS count FROM stdin ' \
> 'GROUP BY borough ORDER BY count DESC' | csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Discussion
As data scientists, we work with data, and sometimes a lot of data. This means that we
often need to run a command multiple times or distribute data-intensive commands
over multiple CPU cores or machines. This chapter has shown how easy it is to paral‐
lelize commands. GNU Parallel is a very powerful and flexible tool to speed up ordi‐
nary command-line tools and distribute them over multiple cores and remote
machines. It offers a lot of functionality, and in this chapter we've only been able to
scratch the surface. Some features of GNU Parallel that we haven't covered include:
Search WWH ::




Custom Search