Database Reference
In-Depth Information
We can sum the counts in each CSV file using
Rio
and the
aggregate
function in R:
$
cat *.csv | header -a borough,count |
>
Rio -e
'aggregate(count ~ borough, df, sum)'
|
>
csvsort -rc count | csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Or, if you prefer to use SQL to aggregate results, you can use
csvsql
as discussed in
Chapter 5
:
$
cat *.csv | header -a borough,count |
>
csvsql --query
'SELECT borough, SUM(count) AS count FROM stdin '
\
>
'GROUP BY borough ORDER BY count DESC'
| csvlook
|----------------+--------|
| borough | count |
|----------------+--------|
| unspecified | 467 |
| manhattan | 274 |
| brooklyn | 103 |
| queens | 77 |
| bronx | 44 |
| staten_island | 35 |
|----------------+--------|
Discussion
As data scientists, we work with data, and sometimes a lot of data. This means that we
often need to run a command multiple times or distribute data-intensive commands
over multiple CPU cores or machines. This chapter has shown how easy it is to paral‐
lelize commands. GNU Parallel is a very powerful and flexible tool to speed up ordi‐
nary command-line tools and distribute them over multiple cores and remote
machines. It offers a lot of functionality, and in this chapter we've only been able to
scratch the surface. Some features of GNU Parallel that we haven't covered include: