Database Reference
In-Depth Information
Example 4-5
uses pure Python. When you want to do advanced text
processing, we recommend you check out the NLTK package (Per‐
kins, 2010). If you are going to work with a lot of numerical data,
then we recommend you use the Pandas package (McKinney, 2012).
And in R, the code would look something like
Example 4-6
(thanks to Hadley
Wickham):
Example 4-6. ~/book/ch04/top-words.R
#!/usr/bin/env Rscript
n
<-
as.integer
(
commandArgs
(
trailingOnly
=
TRUE
))
f
<-
file
(
"stdin"
)
lines
<-
readLines
(
f
)
words
<-
tolower
(
unlist
(
strsplit
(
lines
,
"\\W+"
)))
counts
<-
sort
(
table
(
words
),
decreasing
=
TRUE
)
counts_n
<-
counts
[
1
:
n
]
cat
(
sprintf
(
"%7d %s\n"
,
counts_n
,
names
(
counts_n
)),
sep
=
""
)
close
(
f
)
Let's check that all three implementations (i.e., Bash, Python, and R) return the same
top 5 words with the same counts:
$
< data/76.txt ./top-words-5.sh 5
6441 and
5082 the
3666 i
3258 a
3022 to
$
< data/76.txt ./top-words.py 5
6441 and
5082 the
3666 i
3258 a
3022 to
$
< data/76.txt ./top-words.R 5
6441 and
5082 the
3666 i
3258 a
3022 to
Wonderful! Sure, the output itself is not very exciting. What
is
exciting is the observa‐
tion that we can accomplish the same task with multiple approaches. Let's have a look
at the differences between the approaches.
First, what's immediately obvious is the difference in amount of code. For this specific
task, both Python and R require much more code than Bash. This illustrates that, for
some tasks, it can be more efficient to use the command line. For other tasks, you