Database Reference
In-Depth Information
sentence, it is in lowercase. We'd like to correct for this, so “Falchion” and
“falchion” show up as the same word. To do this, we can just convert the
word to lowercase. Unfortunately, we need to do this conversion before the
JOIN operation takes place, which means we need to use a subselect.
We change our original FROM clause into a nested select statement that
computes the lowercase word. We make one other change: We filter out
words that are all caps. In our Shakespeare dataset, names of characters
display in all caps, such as “HAMLET.” Because names of characters are
going to show up in the plays more often than in a wider dataset, to prevent
these from distorting the results, we filter them out. We also filter out
anything with an apostrophe—Shakespeare made up a lot of contractions
that don't exist in modern spelling—”wrong'd,” for example—and these
aren't particularly interesting to report.
As a bonus, we've also applied the same lowercase conversion to the
word_frequency corpus. In this case, it wasn't necessary, but we've added
it here to show that you can also perform subselects in the JOIN clause too.
We've also divided the counts by the number of rows in each table, so that
neither table is weighted more heavily than the other.
Query #4 starts to look complicated. But if you compare it to query #3, it
isn't that different. This is a common pattern to use in BigQuery, maybe
more so than in other dialects of SQL. You start with a simple SQL query,
and start nesting subselects when you need to convert or aggregate values.
It can make it harder to read the query, but can allow you get a lot done in a
single query without creating temporary tables.
For those playing along at home, you'll notice that “villainy” is still in the
#1 spot, but some other interesting words show up near the top: “trencher,”
“falchion,” “tapster,” “fitly,” and “spritely”—these all sound like good names
for startups.
+----------------+--------------------+
| word | rel_freq |
+----------------+--------------------+
| villany | 6292.536177703534 |
| amain | 1926.2865850112858 |
| severally | 1797.867479343867 |
| trencher | 1412.6101623416098 |
| beseeming | 1284.1910566741906 |
Search WWH ::




Custom Search