Database Reference
In-Depth Information
contributor_id,
(LOWER(title) = 'george w. bush') AS bush_edit,
(LOWER(title) = 'barrack obama') AS obama_edit
FROM [publicdata:samples.wikipedia])
GROUP EACH BY 1
HAVING bush_all OR bush_some OR obama_all OR
obama_some)
GROUP BY 1, 2, 3, 4
ORDER BY 1, 2, 3, 4
The innermost query could have been collapsed into the containing query,
but it is often more readable if you have a query that solely computes
the conditions on individual records, especially if a condition is used more
than once in the containing query as is the case here. The GROUP EACH
BY nested within a GROUP BY query is typical for cohort analysis. The
EACH query is collecting and summarizing all the records for every user.
It is common to need EACH because there are usually a large number of
entities, so shuffling the data among nodes avoids hitting resource limits.
This inner query produces one record per user with fields that describe the
properties of interest. The outer query then counts the number of users
falling in each cohort, where the cohort is defined by specific values for the
properties computed. In our example, the properties are “user has edited at
least one Bush page,” “user has only edited Obama pages,” and so on. This
grouping does not have an EACH because the number of different cohorts
tends to be small for any given query, in our example, just 5. You can
accumulate multiple statistics in one pass; for example, this query computes
the number of unique users in each cohort and the average number of edits
per contributor in a cohort.
To illustrate more complex cohort conditions that can be implemented using
the same basic strategy, imagine defining a cohort based on the relative
times of the first edit.
SELECT
IF(bush_some AND obama_some ,
first_bush < first_obama, NULL) AS bush_earlier ,
bush_some,
obama_some,
COUNT(1) AS num,
Search WWH ::




Custom Search