Parallel Pipelines - Data Science at the Command Line

Database Reference

In-Depth Information

Processing data/movies.csv

Processing data/top250.csv

Here's the same example, but now using parallel :

$ find data -name '*.csv' -print0 | parallel -0 echo "Processing {}"

Processing data/countries.csv

Processing data/movies.csv

Processing data/top250.csv

The -print0 option allows filenames that contain newlines or other types of white‐

space to be correctly interpreted by programs that process the output of find . If you

are absolutely certain that the filenames contain no special characters such as spaces

and newlines, then you can omit the -print0 and -0 options.

If the list to process becomes too complex, you can always store the

result to a temporary file and then use the method to loop over

lines from a file.

Parallel Processing

Assume that we have a very long-running command, such as the one shown in

Example 8-1 .

Example 8-1. ~/book/ch08/slow.sh

#!/bin/bash

echo "Starting job $1"

duration = $(( 1 + RANDOM%5 ))

sleep $duration

echo "Job $1 took ${duration} seconds"

$RANDOM is an internal Bash function that returns a pseudorandom integer

between 0 and 32,767. Taking the remainder of the division of that number by 5

and adding 1 ensures that the number is between 1 and 5.

This process does not take up all the resources we have available. And it so happens

that we need to run this command a lot of times. For example, we need to download a

long sequence of files.

A naive way to parallelize is to run the commands in the background:

$ for i in { 1..4 } ; do

> ( ./slow.sh $i ; echo Processed $i ) &

> done

[1] 3334

[2] 3335

Search WWH ::

Custom Search

Home