Advanced Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

JOIN Key CARTESIAN-Product Explosion

One aspect of SQL JOIN s that often confuses people who are relatively

new to using them is what happens when there are multiple matches on

the left and the right. For instance, the following query joins the GitHub

and Shakespeare tables where the Shakespeare corpus name matches

the GitHub repository name and the Shakespeare corpus is The

Tempest :

SELECT shakespeare.corpus

FROM [publicdata:samples.github_timeline] github

JOIN [publicdata:samples.shakespeare] shakespeare

ON github.repository_name = shakespeare.corpus

WHERE shakespeare.corpus = "tempest"

There are 146 rows in the GitHub table where the repository name is

“tempest” and 3636 rows in the Shakespeare table where the corpus

name is the same. So when we join these two tables by matching the

repository name with the corpus, we get 146 * 3636 which is more than

one-half of a million rows.

It is easy to see that if you're not careful, when you run a JOIN

operation, you can get unintended explosions in the results. Imagine

joining a table of customers with a table of orders on the

customer_id . If both tables have a significant number of frequently

occurring values, the join could end up producing billions of results by

matching the commonly occurring value in one table with each instance

of the value in the other table. If you run a JOIN operation and

BigQuery tells you the results are too large to return, JOIN key

explosion may be the culprit.

Analytic and Windowing Functions

There are some data transformations that are awkward to express in SQL.

A good example is computing the rank of a record within a given group of

records. The basic problem is that the tuple relational calculus, the formal

framework underlying SQL, does not actually support ordered sequences

Search WWH ::

Custom Search

Home