Database Reference
In-Depth Information
only the first element in the trigram, and sum up how many times it appears
throughout the corpus. The subquery follows:
SELECT LOWER(first) AS word, sum(cell.match_count)/
1583 AS freq
FROM [publicdata:samples.trigrams]
GROUP BY word)
This should look very familiar, although the appearance of the constant 1583
may be surprising. This number is the ratio of the number of total elements
in the trigrams corpus to the number of elements in the word_count
corpus. We scale the frequencies this way so that the trigrams dataset
doesn't dwarf the importance of the word_count corpus; we'd like to treat
them as equals, despite one having vastly more data.
The next step is combining the two tables, word_count and trigrams .
Query #5 shows how to combine these tables using a table union. The syntax
for a table union is simple: concatenate two or more tables together with a
comma (for example, FROM table1, table2, . . . ).
If you're a SQL guru, you're probably more than a little bit surprised by
this—there is already an accepted standard interpretation of tables
separated by commas—an implicit JOIN . BigQuery, perhaps unfortunately,
changes this convention to mean UNION ALL . This nonstandard behavior is
discussed in the next section.
In this query, we want to combine the data from the two tables but they have
different column names. To get the column names to match up, we use a
subselect to lowercase the words and change the column names. Then we
group by word and sum the word counts to get a single effective table that
has the combined frequency data.
There is another change in this query, related to a potential problem with
numerical precision when summing the ratios. To fix this, we changed the
first subselect to compute the total word frequency across all plays, so there
will be a single row per word. We moved the GROUP BY inside that
sub-select, so the ratio computation is done only once per word and not as
an aggregation.
If you look at these results, you'll find that Shakespeare uses some strange
words, indeed:
Search WWH ::




Custom Search