Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

result, parallel cannot determine the number of cores and will default to using one

CPU core. When you receive this warning message, you can do one of the following

four things:

• Don't worry, and be happy with using one CPU core per machine.

• Specify the number of jobs per machine via the -j option.

• Specify the number of cores to use per machine by putting, for example, 2/ if you

want two cores, in front of each hostname in the instances file.

• Install GNU Parallel using a package manager (not that this is usually not the lat‐

est version). For example, on Ubuntu:

$ parallel --nonall --slf instances "sudo apt-get install -y parallel"

Distributing Local Data Among Remote Machines

The second flavor of distributed processing is to distribute local data directly among

remote machines. Imagine you have one very large data set that you want to process

using multiple remote machines. For simplicity, we're going to sum all integers from 1

to 1,000. First, let's verify that our input is actually being distributed by printing the

hostname of the remote machine and the length of the input it received using wc :

$ seq 1000 | parallel -N100 --pipe --slf hosts "(hostname; wc -l) | paste -sd:"

ip-172-31-23-204:100

ip-172-31-23-205:100

ip-172-31-23-204:100

ip-172-31-23-205:100

ip-172-31-23-204:100

ip-172-31-23-205:100

ip-172-31-23-204:100

ip-172-31-23-205:100

ip-172-31-23-204:100

We can verify that our 1,000 numbers get distributed evenly in subsets of 100 (as

specified by -N100 ). Now, we're ready to sum all those numbers:

$ seq 1000 | parallel -N100 --pipe --slf hosts "paste -sd+ | bc" |

> paste -sd+ | bc

500500

Here, we immediately also sum the 10 sums we get back from the remote machines.

Let's double check the answer is correct:

$ seq 1000 | paste -sd+ | bc

500500

Search WWH ::

Custom Search

Home