Database Reference
In-Depth Information
There are two problems with this naive approach. First, there's no way to control how
many processes you are running concurrently. The second issue is logging; which
output belongs to which input:
Introducing GNU Parallel
GNU Parallel (Tange, 2014) is a command-line tool that allows us to parallelize com‐
mands and pipelines. The beauty of this tool is that existing tools can be used as they
are; they do not need to be modified. Before we go into the details of GNU Parallel,
here's a little teaser to show you how easy it is to parallelize the for loop stated above:
$ seq 5 | parallel "echo {}^2 | bc"
1
4
9
16
25
This is parallel in its simplest form: without any options. As you can see, it basically
acts as a for loop. (We'll explain later what is going on exactly.) With no less than 110
options(!), GNU Parallel offers a lot of additional functionality. Don't worry, by the
end of this chapter, you'll have a solid understanding of the most important ones.
Install GNU Parallel by running the following commands:
$ wget http://ftp.gnu.org/gnu/parallel/parallel-latest.tar.bz2
$ tar -xvjf parallel-latest.tar.bz2 > extracted-files
$ cd $( head -n 1 extracted-files )
$ ./configure && make && sudo make install
You may have noticed that we keep writing GNU Parallel. That's
because there are two tools with the name “parallel.” If you make
use of the Data Science Toolbox you already have the correct one
installed. Otherwise, double check that you have installed the cor‐
rect tool installed by running parallel --version .
You can verify that you have correctly installed GNU Parallel:
$ parallel --version | head -n 1
GNU parallel 20140622
To delete the created files and directories, run the following:
$ cd ..
$ rm -r $( head -n 1 extracted-files )
$ rm parallel-latest.tar.bz2 extracted-files
Search WWH ::




Custom Search