Managing Your Data Workflow - Data Science at the Command Line

Database Reference

In-Depth Information

--- 1: data/top-5 <- data/top.html -> done in 0.02s

Done (2 steps run).

Now, let's assume that we want instead of the top 5 ebooks, the top 10 ebooks. We can

set the NUM variable from the command line and run Drake ( Example 6-4 ).

Example 6-4. Drake worklow with NUM=10 (02.drake)

$ NUM = 10 drake -w 02.drake

The following steps will be run, in order:

1: data/top-10 <- data/top.html [missing output]

Confirm? [y/n] y

Running 1 steps with concurrence of 1...

--- 1. Running (missing output): data/top-10 <- data/top.html

--- 1: data/top-10 <- data/top.html -> done in 0.02s

Done (1 steps run).

As you can see, Drake now only needs to execute the second step, because the output

of the first step has already been satisfied. Again, downloading an HTML file is not

such a big deal, but can you imagine the implications if you were dealing with 10 GB

worth of data?

Rebuilding Speciic Targets

The list of the top 100 ebooks on project Gutenberg changes daily. We've seen that if

we run the Drake workflow again, the HTML containing this list is not downloaded

again. Luckily, Drake allows us to run certain steps again so that we can update this

HTML file:

$ drake -w 02.drake '=top.html'

There is a more convenient way than using the output filename to specify which step

you want to execute again. We can add so-called tags to both the input and output of

steps. A tag starts with a % . It's a good idea to choose a short and descriptive tag name

so that you can easily specify this at the command line. Let's add the tag %html to the

first step and %filter to the second step, as in Example 6-5 .

Example 6-5. Drake worklow with tags (03.drake)

NUM:=5

BASE=data/

top.html, %html <- [-timecheck]

curl -s 'http://www.gutenberg.org/browse/scores/top' > $OUTPUT

top-$[NUM], %filter <- top.html

< $INPUT grep -E '^<li>' |

Search WWH ::

Custom Search

Home