Database Reference
In-Depth Information
CHAPTER 6
Managing Your Data Worklow
We hope that by now you have come to appreciate that the command line is a very
convenient environment for doing data science. You may have noticed that, as a con‐
sequence of working at the command line, we:
• Invoke many different commands
• Create custom and ad-hoc command-line tools
• Obtain and generate many (intermediate) files
As this process is of an exploratory nature, our workflow tends to be rather chaotic,
which makes it difficult to keep track of what we've done. It's very important that our
steps can be reproduced, whether by ourselves or by others. When we, for example,
continue with a project from a few weeks earlier, chances are that we have forgotten
which commands we have run, on which files, in which order, and with which
parameters. Imagine the difficulty of passing on your analysis to a collaborator.
You may recover some lost commands by digging into your Bash history, but this is,
of course, not a good approach. A better approach would be to save your commands
to a Bash script, such as run.sh . This allows you and your collaborators to at least
reproduce the analysis. A shell script is, however, a suboptimal approach because:
• It's difficult to read and to maintain.
• Dependencies between steps are unclear.
• Every step gets executed every time, which is inefficient and sometimes
undesirable.
This is where Drake comes in handy (Factual, 2014). Drake is command-line tool
created by Factual that allows you to:
 
Search WWH ::




Custom Search