Database Reference
In-Depth Information
still a good idea to use head() rather than print() to display the results.
If there are more than a few dozen results, it won't be particularly useful to
list them all to the screen, so head() will just show the first few rows.
> dim(results)
[1] 42 3
> head(results)
corpus date c
1 various 0 1349
2 sonnets 0 3677
3 1kinghenryvi 1590 4441
4 3kinghenryvi 1590 4076
5 2kinghenryvi 1590 4683
6 kingrichardiii 1592 4713
R Example: Predicting Shakespeare
Now that you've seen how to use bigrquery to run BigQuery queries in R,
let's try a more interesting example: See if we can predict, based on the
words in a Shakespeare play, whether the play is a comedy, a history, or a
tragedy. This type of classification is an example of something that is easy to
do in R but cannot be done directly from SQL.
We use a naïve Bayesian classifier to classify the plays. Although this might
sound, well, naïve, this is a powerful prediction mechanism. For example,
naïve Bayes is the basis for most spam filters. If you think about it,
predicting whether a play is a comedy, history, or tragedy from word usage
is similar to predicting whether an e-mail is spam. Perhaps in the 17 th
century, people worried about “unsolicited histories” that they'd have to sit
through when what they actually wanted was a light comedy. In that case,
our classifier would have been able to tell them whether they should stay
home instead.
To start, first find a filtered list of all the words used in Shakespeare plays.
Exclude the words used in every play because they don't provide any
predictive power. Exclude, also, the words that are used only in a single play
because they could lead to overfitting the data. Here's the query that gets all
the words in Shakespeare that show up in more than 1 and fewer than 35
plays:
Search WWH ::




Custom Search