Advanced Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

contributor_id,

(LOWER(title) = 'george w. bush') AS bush_edit,

(LOWER(title) = 'barrack obama') AS obama_edit

FROM [publicdata:samples.wikipedia])

GROUP EACH BY 1

HAVING bush_all OR bush_some OR obama_all OR

obama_some)

GROUP BY 1, 2, 3, 4

ORDER BY 1, 2, 3, 4

The innermost query could have been collapsed into the containing query,

but it is often more readable if you have a query that solely computes

the conditions on individual records, especially if a condition is used more

than once in the containing query as is the case here. The GROUP EACH

BY nested within a GROUP BY query is typical for cohort analysis. The

EACH query is collecting and summarizing all the records for every user.

It is common to need EACH because there are usually a large number of

entities, so shuffling the data among nodes avoids hitting resource limits.

This inner query produces one record per user with fields that describe the

properties of interest. The outer query then counts the number of users

falling in each cohort, where the cohort is defined by specific values for the

properties computed. In our example, the properties are “user has edited at

least one Bush page,” “user has only edited Obama pages,” and so on. This

grouping does not have an EACH because the number of different cohorts

tends to be small for any given query, in our example, just 5. You can

accumulate multiple statistics in one pass; for example, this query computes

the number of unique users in each cohort and the average number of edits

per contributor in a cohort.

To illustrate more complex cohort conditions that can be implemented using

the same basic strategy, imagine defining a cohort based on the relative

times of the first edit.

SELECT

IF(bush_some AND obama_some ,

first_bush < first_obama, NULL) AS bush_earlier ,

bush_some,

obama_some,

COUNT(1) AS num,

Search WWH ::

Custom Search

Home