Database Reference
In-Depth Information
• Formalize your data workflow steps in terms of input and output dependencies
• Run specific steps of your workflow from the command line
• Use inline code (e.g., Python and R)
• Store and retrieve data from external sources (e.g., S3 and HDFS)
Overview
Managing your data workflow with Drake is the main topic of this chapter. As such,
you'll learn about:
• Defining your workflow with a so-called Drakeile
• Thinking about your workflow in terms of input and output dependencies
• Building specific targets
Introducing Drake
Drake organizes command execution around data and its dependencies. Your data
processing steps are formalized in a separate text file (a workflow). Each step usually
has one or more inputs and outputs. Drake automatically resolves their dependencies
and determines which commands need to be run and in which order.
This means that when you have, say, an SQL query that takes 10 minutes, it only has
to be executed when the result is missing or when the query has changed afterwards.
Also, if you want to (re-)run a specific step, Drake only considers (re-)running the
steps on which it depends. This can save you a lot of time.
The benefit of having a formalized workflow allows you to easily pick up your project
after a few weeks and to collaborate with others. We strongly advise you to do this,
even when you think this will be a one-off project, because you'll never know when to
run certain steps again, or when you want to reuse certain steps in another project.
Installing Drake
Drake has quite a few dependencies, which makes its installation process rather
involved. For the following instructions, we assume that you are on Ubuntu.
If you're using the Data Science Toolbox, then you already have
Drake installed, and you may safely skip this section.
Search WWH ::




Custom Search