Managing Your Data Workflow - Data Science at the Command Line

Database Reference

In-Depth Information

CHAPTER 6

Managing Your Data Worklow

We hope that by now you have come to appreciate that the command line is a very

convenient environment for doing data science. You may have noticed that, as a con‐

sequence of working at the command line, we:

• Invoke many different commands

• Create custom and ad-hoc command-line tools

• Obtain and generate many (intermediate) files

As this process is of an exploratory nature, our workflow tends to be rather chaotic,

which makes it difficult to keep track of what we've done. It's very important that our

steps can be reproduced, whether by ourselves or by others. When we, for example,

continue with a project from a few weeks earlier, chances are that we have forgotten

which commands we have run, on which files, in which order, and with which

parameters. Imagine the difficulty of passing on your analysis to a collaborator.

You may recover some lost commands by digging into your Bash history, but this is,

of course, not a good approach. A better approach would be to save your commands

to a Bash script, such as run.sh . This allows you and your collaborators to at least

reproduce the analysis. A shell script is, however, a suboptimal approach because:

• It's difficult to read and to maintain.

• Dependencies between steps are unclear.

• Every step gets executed every time, which is inefficient and sometimes

undesirable.

This is where Drake comes in handy (Factual, 2014). Drake is command-line tool

created by Factual that allows you to:

Search WWH ::

Custom Search

Home