Database Reference
In-Depth Information
Exact Count Distinct
The COUNT(DISTINCT field , N ) aggregation function is approximate
and intended to be a quick way to roughly determine the cardinality of a
given field. It is exact for cardinalities less than N , so it is tempting to simply
use it with a large value of N to determine the exact cardinality. This is
problematic because it consumes memory proportional to N causing these
kinds of queries to fail. There is a straightforward way to determine the exact
cardinality of a field, and it also scales to larger cardinalities. The query
is slower because it involves a shuffle operation, so rather than using the
approach outlined next all the time, reserve it for when you have to compute
exact cardinalities for large sets.
Here is the simplest version of the query that highlights how the shuffle is
used to collapse duplicate values in a given field.
SELECT COUNT(1) AS unique FROM (
SELECT 1 FROM [publicdata:samples.shakespeare]
GROUP EACH BY word)
The inner query buckets each input row into a group associated with a
distinct word. The SELECT 1 is perhaps a little confusing. The query would
remain the same if you change it to SELECT word, but you can use a
constant expression here to emphasize that you do not care about the actual
value of the word outside the grouping clause. COUNT(1) similarly indicates
that we are interested in the number of rows rather than any particular field.
This approach scales to millions of distinct values because EACH distributes
the aggregation operation.
It is fairly common to want to determine distinct values broken down across
one or more dimensions. Daily active users over the last seven days or
distinct HTTP user agents by URL are examples of such queries. All you
need to do is add an additional field to the grouping clause and thread it
through to the outer query.
SELECT wp_namespace,
COUNT(num) AS unique,
SUM(num) AS total
FROM (
SELECT wp_namespace,
Search WWH ::




Custom Search