Creating Reusable Command-Line Tools - Data Science at the Command Line

Database Reference

In-Depth Information

Downloading the ebook using curl .

Converting the entire text to lowercase using tr (Meyering, 2012).

Extracting all the words using grep (Meyering, 2012) and putting each word on a

separate line.

Sorting these words in alphabetical order using sort (Haertel & Eggert, 2012).

Removing all the duplicates and counting how often each word appears in the list

using uniq (Stallman & MacKenzie, 2012).

Sorting this list of unique words by their count in descending order using sort .

Keeping only the top 10 lines (i.e., words) using head .

Each command-line tool used in this one-liner offers a man page.

So, in case you would like to know more about, say, grep , you can

run man grep from the command line. The command-line tools

tr , grep , uniq , and sort will be discussed in more detail in the

next chapter.

There is nothing wrong with running this one-liner just once. However, imagine if we

wanted to find the top 10 words of every ebook on Project Gutenberg. Or imagine

that we wanted the top 10 words of a news website on an hourly basis. In those cases,

it would be best to have this one-liner as a separate building block that can be part of

something bigger. We want to add some flexibility to this one-liner in terms of

parameters, so we'll turn it into a shell script.

Since we use Bash as our shell, the script will be written in the programming language

Bash. This allows us to take the one-liner as the starting point, and gradually improve

on it. To turn this one-liner into a reusable command-line tool, we'll walk you

through the following six steps:

1. Copy and paste the one-liner into a file.

2. Add execute permissions.

3. Define a so-called shebang.

4. Remove the fixed input part.

5. Add a parameter.

6. Optionally extend your PATH.

Search WWH ::

Custom Search

Home