Database Reference
In-Depth Information
ORDER BY rel_freq DESC
LIMIT 100
WARNING
This query processes 54 GB of data, which would cost $0.27 at
current BigQuery prices. This might not break the bank, but
you also might want to think twice before running it too
often.
This query looks a bit intimidating—five SELECT statements, one JOIN ,
three GROUP BY s, and a lot of parentheses. But if you look at it carefully,
you've already seen just about everything used in this query, with a single
exception: We combine two SELECT statements together with a comma.
This is addressed in a minute, but first consider what this query does and
why you'd want to use it.
If you look at the actual numbers returned from query #4, you'll notice
something surprising: The ratio between the number of times “villainy”
appears in Shakespeare to the total number of times it appears in a wider
English-language corpus is 49:1. That is 49 times more total, not 49 times
more frequently. This means that “villainy” probably appeared only one or
two times in the word_frequency corpus, but it appeared a lot more often
in Shakespeare. Clearly, the Shakespeare texts weren't part of the corpus
used.
This problem can be fixed by adding another data source that contains
the Shakespeare texts. The trigrams table in the BigQuery public sample
dataset contains every three-word combination in the English language that
was found during the Google Books book-scanning project. There are 68
billion rows in the table, representing 192 billion scanned words. BigQuery
is supposed to work on Big Data, but so far we've mostly used toy tables; this
query should test BigQuery's handling of a real Big Data table.
The first step, as you can see in the added inner query, is rolling up the
trigrams to individual words. The table has three fields representing the
word in the trigram: first , second , and third . We're going to consider
Search WWH ::




Custom Search