Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

There are two problems with this naive approach. First, there's no way to control how

many processes you are running concurrently. The second issue is logging; which

output belongs to which input:

Introducing GNU Parallel

GNU Parallel (Tange, 2014) is a command-line tool that allows us to parallelize com‐

mands and pipelines. The beauty of this tool is that existing tools can be used as they

are; they do not need to be modified. Before we go into the details of GNU Parallel,

here's a little teaser to show you how easy it is to parallelize the for loop stated above:

$ seq 5 | parallel "echo {}^2 | bc"

1

4

9

16

25

This is parallel in its simplest form: without any options. As you can see, it basically

acts as a for loop. (We'll explain later what is going on exactly.) With no less than 110

options(!), GNU Parallel offers a lot of additional functionality. Don't worry, by the

end of this chapter, you'll have a solid understanding of the most important ones.

Install GNU Parallel by running the following commands:

$ wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2

$ tar -xvjf parallel-latest.tar.bz2 > extracted-files

$ cd $( head -n 1 extracted-files )

$ ./configure && make && sudo make install

You may have noticed that we keep writing GNU Parallel. That's

because there are two tools with the name “parallel.” If you make

use of the Data Science Toolbox you already have the correct one

installed. Otherwise, double check that you have installed the cor‐

rect tool installed by running parallel --version .

You can verify that you have correctly installed GNU Parallel:

$ parallel --version | head -n 1

GNU parallel 20140622

To delete the created files and directories, run the following:

$ cd ..

$ rm -r $( head -n 1 extracted-files )

$ rm parallel-latest.tar.bz2 extracted-files

Search WWH ::

Custom Search

Home