Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

We can sum the counts in each CSV file using Rio and the aggregate function in R:

$ cat *.csv | header -a borough,count |

> Rio -e 'aggregate(count ~ borough, df, sum)' |

> csvsort -rc count | csvlook

|----------------+--------|

| borough | count |

|----------------+--------|

| unspecified | 467 |

| manhattan | 274 |

| brooklyn | 103 |

| queens | 77 |

| bronx | 44 |

| staten_island | 35 |

|----------------+--------|

Or, if you prefer to use SQL to aggregate results, you can use csvsql as discussed in

Chapter 5 :

$ cat *.csv | header -a borough,count |

> csvsql --query 'SELECT borough, SUM(count) AS count FROM stdin ' \

> 'GROUP BY borough ORDER BY count DESC' | csvlook

|----------------+--------|

| borough | count |

|----------------+--------|

| unspecified | 467 |

| manhattan | 274 |

| brooklyn | 103 |

| queens | 77 |

| bronx | 44 |

| staten_island | 35 |

|----------------+--------|

Discussion

As data scientists, we work with data, and sometimes a lot of data. This means that we

often need to run a command multiple times or distribute data-intensive commands

over multiple CPU cores or machines. This chapter has shown how easy it is to paral‐

lelize commands. GNU Parallel is a very powerful and flexible tool to speed up ordi‐

nary command-line tools and distribute them over multiple cores and remote

machines. It offers a lot of functionality, and in this chapter we've only been able to

scratch the surface. Some features of GNU Parallel that we haven't covered include:

Search WWH ::

Custom Search

Home