Database Reference
In-Depth Information
contributor_id,
(LOWER(title) = 'george w. bush') AS bush_edit,
(LOWER(title) = 'barrack obama') AS obama_edit
FROM [publicdata:samples.wikipedia])
GROUP EACH BY 1
HAVING bush_all OR bush_some OR obama_all OR
obama_some)
GROUP BY 1, 2, 3, 4
ORDER BY 1, 2, 3, 4
The innermost query could have been collapsed into the containing query,
but it is often more readable if you have a query that solely computes
the conditions on individual records, especially if a condition is used more
than once in the containing query as is the case here. The
GROUP EACH
BY
nested within a
GROUP BY
query is typical for cohort analysis. The
EACH
query is collecting and summarizing all the records for every user.
It is common to need
EACH
because there are usually a large number of
entities, so shuffling the data among nodes avoids hitting resource limits.
This inner query produces one record per user with fields that describe the
properties of interest. The outer query then counts the number of users
falling in each cohort, where the cohort is defined by specific values for the
properties computed. In our example, the properties are “user has edited
at
least
one Bush page,” “user has
only
edited Obama pages,” and so on. This
grouping does not have an
EACH
because the number of different cohorts
tends to be small for any given query, in our example, just 5. You can
accumulate multiple statistics in one pass; for example, this query computes
the number of unique users in each cohort and the average number of edits
per contributor in a cohort.
To illustrate more complex cohort conditions that can be implemented using
the same basic strategy, imagine defining a cohort based on the relative
times of the first edit.
SELECT
IF(bush_some AND obama_some
,
first_bush < first_obama, NULL) AS bush_earlier
,
bush_some,
obama_some,
COUNT(1) AS num,