Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

The command line rules

If you ever wonder whether your GNU Parallel command is set up

correctly, you can add the --dryrun option. Instead of actually exe‐

cuting the command, GNU Parallel will print out all the com‐

mands exactly as if they would have been executed.

Controlling the Number of Concurrent Jobs

By default, parallel runs one job per CPU core in parallel. You can control the num‐

ber of jobs that will be run in parallel with the --jobs or -j option. Simply specifying

a number, say n , means that n jobs will be run in parallel. If you put a plus sign in

front of the number n , then parallel will run m+n jobs plus the number of CPU

cores, where m is the number of CPU cores. If you put a minus sign in front of the

number, then parallel will run m-n jobs. You can also specify a percentage to the -j

option. So, the default is 100% of the number of CPU cores. The optimal number of

jobs to run in parallel depends on the actual commands you are running:

$ seq 5 | parallel -j0 "echo Hi {}"

Hi 1

Hi 2

Hi 3

Hi 4

Hi 5

$ seq 5 | parallel -j200% "echo Hi {}"

Hi 1

Hi 2

Hi 3

Hi 4

Hi 5

If you specify -j1 , then the commands will be run in serial. Even though this doesn't

do the name of the tool of justice, it still has its uses. For example, when you need to

access an API which only allows one connection at a time. If you specify -j0 , then

parallel will run as many jobs in parallel as possible. This can be compared to loop‐

ing with subshells, which is not advised.

Logging and Output

To save the output of each command, you might be tempted to do the following:

$ seq 5 | parallel "echo \"Hi {}\" > data/hi-{}.txt"

This will save the output into individual files. Or, if you want to save everything into

one big file, you could do the following:

Search WWH ::

Custom Search

Home