Creating Reusable Command-Line Tools - Data Science at the Command Line

Database Reference

In-Depth Information

Example 4-5 uses pure Python. When you want to do advanced text

processing, we recommend you check out the NLTK package (Per‐

kins, 2010). If you are going to work with a lot of numerical data,

then we recommend you use the Pandas package (McKinney, 2012).

And in R, the code would look something like Example 4-6 (thanks to Hadley

Wickham):

Example 4-6. ~/book/ch04/top-words.R

#!/usr/bin/env Rscript

n <- as.integer ( commandArgs ( trailingOnly = TRUE ))

f <- file ( "stdin" )

lines <- readLines ( f )

words <- tolower ( unlist ( strsplit ( lines , "\\W+" )))

counts <- sort ( table ( words ), decreasing = TRUE )

counts_n <- counts [ 1 : n ]

cat ( sprintf ( "%7d %s\n" , counts_n , names ( counts_n )), sep = "" )

close ( f )

Let's check that all three implementations (i.e., Bash, Python, and R) return the same

top 5 words with the same counts:

$ < data/76.txt ./top-words-5.sh 5

6441 and

5082 the

3666 i

3258 a

3022 to

$ < data/76.txt ./top-words.py 5

6441 and

5082 the

3666 i

3258 a

3022 to

$ < data/76.txt ./top-words.R 5

6441 and

5082 the

3666 i

3258 a

3022 to

Wonderful! Sure, the output itself is not very exciting. What is exciting is the observa‐

tion that we can accomplish the same task with multiple approaches. Let's have a look

at the differences between the approaches.

First, what's immediately obvious is the difference in amount of code. For this specific

task, both Python and R require much more code than Bash. This illustrates that, for

some tasks, it can be more efficient to use the command line. For other tasks, you

Search WWH ::

Custom Search

Home