Managing Your Data Workflow - Data Science at the Command Line

Database Reference

In-Depth Information

This workflow is as simple as it gets. It doesn't offer any advantages over having our

command in a Bash script. But don't worry, we promise you that it will get more

exciting. For now, let's run Drake and see what it does with our first workflow:

$ drake

The following steps will be run, in order:

1: data/top-5 <- [missing output]

Confirm? [y/n] y

Running 1 steps with concurrence of 1...

--- 0. Running (missing output): data/top-5 <-

--- 0: data/top-5 <- -> done in 0.35s

Done (1 steps run).

Between steps, you may want to remove the file drake.log , the hid‐

den directory .drake and any output files to force Drake to start

over.

If we do not specify a workflow file, then Drake will use ./Drakeile . Drake first deter‐

mines which steps need to be run. In our case, the one and only step will be run

because it's missing the output. This means that there's no file named data/top-5 .

Drake asks for confirmation before it will execute these steps. We press <Enter> , and

very soon thereafter we see that Drake is done. Drake did not complain about any

errors in our steps. Let's verify that we have the top 5 topics by looking at the output

file data/top-5 :

$ cat data/top-5

1342

76

11

1661

1952

Now we do have the output file. Let's run Drake again:

$ drake

The following steps will be run, in order:

1: data/top-5 <- [no-input step]

Confirm? [y/n] n

Aborted.

As you can see, Drake wants to execute the step again! However, it now mentions a

different reason, namely, that there's no input step ( [no-input-step] ). Its default

behavior is to check whether the input has changed by looking at the timestamp of

the input. However, because we didn't specify any input, Drake doesn't know whether

or not this step should be run again. We can disable this default behavior to check

timestamps as in Example 6-2 .

Search WWH ::

Custom Search

Home