Database Reference
In-Depth Information
Example 6-2. Drake worklow with timecheck (01.drake)
data/top-5 <- [-timecheck]
curl -s 'http://www.gutenberg.org/browse/scores/top' |
grep -E '^<li>' |
head -n 5 |
sed -E "s/.*ebooks\/([0-9]+)\">([^<]+)<.*/\\1,\\2/" > data/top-5
The square brackets indicate that this is an option to the step. The minus ( - ) in front
of timecheck means that we wish to disable checking timestamps. Now, this step is
only run when the output is missing.
Let's use different filenames so that we keep old versions. We can specify a different
workflow name (other than Drakeile ) with the -w option. Let's run Drake once more:
$ mv Drakefile 01.drake
$ drake -w 01.drake
Nothing to do.
Our very first workflow is already saving us time because Drake detects that the step
was not need to be executed again. However, we can do much better than this. This
workflow has three shortcomings that we're going to address in the next section.
Well, That Depends
Our workflow contains just a single step, which means that, just like having a simple
Bash script, everything will be executed all the time. So the first thing we are going to
do is to split up this single step into two steps, where the first step downloads the
HTML, and the second step processes this HTML. The second step obviously
depends on the first step. We can define this dependency in our workflow.
You may have noticed that the number 5 is specified three times. If you ever wanted
to get the top, say, top 10 ebooks from Project Gutenberg, we would have to change
our workflow in three places. This is inefficient and needs to be addressed. Luckily,
Drake supports variables.
It may not be immediately obvious from our workflow, but our data resides in the
same location as the script. It's better to have the data live in a separate location and
have it separated from any code that generates this data. Not only does it keep our
project cleaner, it also allows us to delete the generated data files easier, and we can
easily specify that we do not like the data files to be included in any version control
system such as git (Torvalds & Hamano, 2014). Let's have a look at our improved
workflow in Example 6-3 .
Search WWH ::




Custom Search