Creating Reusable Command-Line Tools - Data Science at the Command Line

Database Reference

In-Depth Information

There are three main reasons for creating command-line tools in a programming lan‐

guage instead of Bash. First, you may have existing code that you wish to be able to

use from the command line. Second, the command-line tool would end up encom‐

passing more than a hundred lines of code. Third, the command-line tool needs to be

very fast.

The six steps in the previous section roughly apply to creating command-line tools in

other programming languages as well. The first step, however, would not be copying

and pasting from the command line, but rather copying and pasting the relevant code

into a new file. Command-line tools in Python and R need to specify python (Python

Software Foundation, 2014) and Rscript (R Foundation for Statistical Computing,

2014), respectively, as the interpreter after the shebang.

When it comes to creating command-line tools using Python and R, there are two

more aspects that deserve special attention, which will be discussed next. First, pro‐

cessing standard input, which comes natural to shell scripts, has to be taken care of

explicitly in Python and R. Second, as command-line tools written in Python and R

tend to be more complex, we may also want to offer the user the ability to specify

more complex command-line arguments.

Porting the Shell Script

As a starting point, let's see how we would port the prior shell script to both Python

and R. In other words, what Python and R code gives us the most often-used words

from standard input? It is not important whether implementing this task in anything

other than a shell programming language is a good idea. What matters is that it gives

us a good opportunity to compare Bash with Python and R.

We will first show the two files top-words.py and top-words.R and then discuss the dif‐

ferences with the shell code. In Python, the code could would look something like

Example 4-5 .

Example 4-5. ~/book/ch04/top-words.py

#!/usr/bin/env python

import re

import sys

from collections import Counter

num_words = int ( sys . argv [ 1 ])

text = sys . stdin . read () . lower ()

words = re . split ( '\W+' , text )

cnt = Counter ( words )

for word , count in cnt . most_common ( num_words ):

print " %7d %s " % ( count , word )

Search WWH ::

Custom Search

Home