Database Reference
In-Depth Information
fields, which are not fields on the original table—they were computed via
aggregating row values.
Query #3: Joins
SELECT shakespeare.word AS word,
SUM(shakespeare.word_count / english.count) AS
rel_freq,
FROM [publicdata:samples.shakespeare] AS shakespeare
JOIN [bigquery-e2e:reference.word_frequency] AS english
ON shakespeare.word = english.word
GROUP BY word
ORDER BY rel_freq DESC
LIMIT 10
For the third query, we combine Shakespeare data with another data
set—one that tells you overall frequency of words in the English language.
This other data set contains two fields:
word
, which is the word in question,
and
word_frequency
, which is the total number of times the word
appeared when scanning a large corpus of English-language documents.
This data came from a free word-frequency list compiled here:
have
downloaded
the
list
and
saved
it
in
a
table
named
bigquery-e2e:reference.word_frequency
.
Armed with a list of overall English word frequency, you can now see how
the word frequencies in Shakespeare compare to overall English-language
usage. Query #3 computes the relative frequency between Shakespeare's
usage and English-language usage.
We no longer have to filter based on word length because (we assume)
the high-frequency words in the Shakespeare corpus like “the” will also be
high-frequency in the English-language corpus and won't show up in the top
of the results. If they do, that might be someone's doctoral thesis waiting
to happen: “An analysis of the preponderance of definite article usage in
Shakespeare. What does usage of the word 'the' in the Bard's early plays tell
us about socio-economic conflict in Elizabethan England?”
The way to combine multiple data sets in this way in SQL is to use a
JOIN
operation. A
JOIN
takes two tables and matches every row in the first table