Using BigQuery from Third-Party Tools - Google BigQuery Analytics

Database Reference

In-Depth Information

still a good idea to use head() rather than print() to display the results.

If there are more than a few dozen results, it won't be particularly useful to

list them all to the screen, so head() will just show the first few rows.

> dim(results)

[1] 42 3

> head(results)

corpus date c

1 various 0 1349

2 sonnets 0 3677

3 1kinghenryvi 1590 4441

4 3kinghenryvi 1590 4076

5 2kinghenryvi 1590 4683

6 kingrichardiii 1592 4713

R Example: Predicting Shakespeare

Now that you've seen how to use bigrquery to run BigQuery queries in R,

let's try a more interesting example: See if we can predict, based on the

words in a Shakespeare play, whether the play is a comedy, a history, or a

tragedy. This type of classification is an example of something that is easy to

do in R but cannot be done directly from SQL.

We use a naïve Bayesian classifier to classify the plays. Although this might

sound, well, naïve, this is a powerful prediction mechanism. For example,

naïve Bayes is the basis for most spam filters. If you think about it,

predicting whether a play is a comedy, history, or tragedy from word usage

is similar to predicting whether an e-mail is spam. Perhaps in the 17 th

century, people worried about “unsolicited histories” that they'd have to sit

through when what they actually wanted was a light comedy. In that case,

our classifier would have been able to tell them whether they should stay

home instead.

To start, first find a filtered list of all the words used in Shakespeare plays.

Exclude the words used in every play because they don't provide any

predictive power. Exclude, also, the words that are used only in a single play

because they could lead to overfitting the data. Here's the query that gets all

the words in Shakespeare that show up in more than 1 and fewer than 35

plays:

Search WWH ::

Custom Search

Home