Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

CHAPTER 8

Parallel Pipelines

In the previous chapters, we've been dealing with commands and pipelines that take

care of an entire task at once. In practice, however, you may find yourself facing a task

that requires the same command or pipeline to run multiple times. For, example, you

may need to:

• Scrape hundreds of web pages

• Make dozens of API calls and transform their output

• Train a classifier for a range of parameter values

• Generate scatter plots for every pair of features in your data set

In any of these examples, there is a certain form of repetition involved. With your

favorite scripting or programming language, you take care of this with a for loop or a

while loop. On the command line, the first thing you might be inclined to do is to

press <Up> (which brings back the previous command), modify the command if nec‐

essary, and press <Enter> (which runs the command again). This is fine for two or

three times, but imagine doing this for, say, dozens of files. Such an approach quickly

becomes cumbersome and inefficient. The good news is that we can write for and

while loops on the command line as well.

Sometimes, repeating fast commands one after another (in serial) is sufficient. When

you have multiple cores (and perhaps even multiple machines) it would be nice if you

could make use of those, especially when you're faced with a data-intensive task.

When using multiple cores or machines, the total running time may be reduced sig‐

nificantly. In this chapter, we'll introduce a very powerful tool called GNU Parallel

that can take care of exactly this. GNU Parallel allows us to apply a command or pipe‐

line with a range of arguments such as numbers, lines, and files. Plus, it allows us to

run our commands in parallel.

Search WWH ::

Custom Search

Home