Database Reference
In-Depth Information
Drip speeds up Java because it reserves an instance of the JVM after
it has been run once. Because of this, you will only notice the speed
up from the second time onwards.
Obtain Top Ebooks from Project Gutenberg
For the remainder of this chapter, we'll use the following task as a running example.
Our goal is to turn the command that we use to solve this task into a Drake workflow.
We start out simple, and work our way towards an advanced workflow in order to
explain to you the various concepts and syntax of Drake.
Project Gutenberg is an ambitious project that, since 1971, has archived and digitized
over 42,000 topics that are available online for free. On its website you can find the
top 100 most downloaded ebooks. Let's assume that we're interested in the top 5
downloads of Project Gutenberg. Because this list is available in HTML (and formatā€
ted in such a way that we don't need scrape ), it's straightforward to obtain the top 5
downloads:
$ cd ~/book/ch06
$ curl -s 'http://www.gutenberg.org/browse/scores/top' |
> grep -E '^<li>' |
> head -n 5 |
> sed -E "s/.*ebooks\/([0-9]+).*/\\1/" > data/top-5
This command:
Downloads the HTML.
Extracts the list items.
Keeps only the top 5 items.
Saves ebook IDs to data/top-5 .
The output of the command is:
$ cat data/top-5
1342
76
11
1661
1952
If you want to be able to reproduce this at a later time, the easiest thing you can do is
put this command in a script as we saw in Chapter 4 . If you execute this script again,
the HTML will be downloaded again as well. There are three common reasons why
you might want to be able to control whether certain steps are run. First, because a
Search WWH ::




Custom Search