Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

Example 8-2. Parallel bc (pbc)

#!/usr/bin/env bash

parallel -C, -k -j100% "echo '$1' | bc -l"

This tool allows us to simplify the code used in the beginning of the chapter, too:

$ seq 100 | pbc '{1}^2' | tail

8281

8464

8649

8836

9025

9216

9409

9604

9801

10000

This tool works as follows. You may remember that seq 100 generates integers 1 to

100, one per line. These lines get piped to pbc , which, in turn, feeds them to paral

lel . The argument to {1} is evaluated by parallel before it sends it to bc . This

means that {1} gets replaced by the value of the first column (there is only one col‐

umn) on the line in question.

Distributed Processing

Sometimes you need more power than your local machine, even with all its cores, can

offer. Luckily, GNU Parallel can also leverage the power of remote machines, which

really allows us to speed up our pipeline.

What's great is that GNU Parallel does not have to be installed on the remote

machine. All that's required is that you can connect to the remote machine via SSH,

which is also what GNU Parallel uses to distribute our pipeline. (Having GNU Paral‐

lel installed remotely is helpful because it can then determine how many cores to

employ on each remote machine; more on this later.)

First, we're going to obtain a list of running AWS EC2 instances. Don't worry if you

don't have any remote machines, you can replace any occurrence of --slf instan

ces , which tells GNU Parallel which remote machines to use, with --sshlogin : .

This way, you can still follow along with the examples in this section.

Once we know which remote machines to take over, we're going to consider three fla‐

vors of distributed processing:

Search WWH ::

Custom Search

Home