Database Reference
In-Depth Information
Example 4-5 uses pure Python. When you want to do advanced text
processing, we recommend you check out the NLTK package (Per‐
kins, 2010). If you are going to work with a lot of numerical data,
then we recommend you use the Pandas package (McKinney, 2012).
And in R, the code would look something like Example 4-6 (thanks to Hadley
Wickham):
Example 4-6. ~/book/ch04/top-words.R
#!/usr/bin/env Rscript
n <- as.integer ( commandArgs ( trailingOnly = TRUE ))
f <- file ( "stdin" )
lines <- readLines ( f )
words <- tolower ( unlist ( strsplit ( lines , "\\W+" )))
counts <- sort ( table ( words ), decreasing = TRUE )
counts_n <- counts [ 1 : n ]
cat ( sprintf ( "%7d %s\n" , counts_n , names ( counts_n )), sep = "" )
close ( f )
Let's check that all three implementations (i.e., Bash, Python, and R) return the same
top 5 words with the same counts:
$ < data/76.txt ./top-words-5.sh 5
6441 and
5082 the
3666 i
3258 a
3022 to
$ < data/76.txt ./top-words.py 5
6441 and
5082 the
3666 i
3258 a
3022 to
$ < data/76.txt ./top-words.R 5
6441 and
5082 the
3666 i
3258 a
3022 to
Wonderful! Sure, the output itself is not very exciting. What is exciting is the observa‐
tion that we can accomplish the same task with multiple approaches. Let's have a look
at the differences between the approaches.
First, what's immediately obvious is the difference in amount of code. For this specific
task, both Python and R require much more code than Bash. This illustrates that, for
some tasks, it can be more efficient to use the command line. For other tasks, you
Search WWH ::




Custom Search