Managing Your Data Workflow - Data Science at the Command Line

Database Reference

In-Depth Information

• Formalize your data workflow steps in terms of input and output dependencies

• Run specific steps of your workflow from the command line

• Use inline code (e.g., Python and R)

• Store and retrieve data from external sources (e.g., S3 and HDFS)

Overview

Managing your data workflow with Drake is the main topic of this chapter. As such,

you'll learn about:

• Defining your workflow with a so-called Drakeile

• Thinking about your workflow in terms of input and output dependencies

• Building specific targets

Introducing Drake

Drake organizes command execution around data and its dependencies. Your data

processing steps are formalized in a separate text file (a workflow). Each step usually

has one or more inputs and outputs. Drake automatically resolves their dependencies

and determines which commands need to be run and in which order.

This means that when you have, say, an SQL query that takes 10 minutes, it only has

to be executed when the result is missing or when the query has changed afterwards.

Also, if you want to (re-)run a specific step, Drake only considers (re-)running the

steps on which it depends. This can save you a lot of time.

The benefit of having a formalized workflow allows you to easily pick up your project

after a few weeks and to collaborate with others. We strongly advise you to do this,

even when you think this will be a one-off project, because you'll never know when to

run certain steps again, or when you want to reuse certain steps in another project.

Installing Drake

Drake has quite a few dependencies, which makes its installation process rather

involved. For the following instructions, we assume that you are on Ubuntu.

If you're using the Data Science Toolbox, then you already have

Drake installed, and you may safely skip this section.

Search WWH ::

Custom Search

Home