Database Reference
In-Depth Information
BigQuery doesn't provide an exact value by default because the exact value
is surprisingly difficult to calculate for a large table. To know whether you've
seen a value before, you have to keep track of all values that you've seen; this
is expensive and difficult to do unless you first sort the data.
If you're not content with the approximate value, you have two options:
Supply a bucket count as an optional argument to COUNT(DISTINCT
field, [buckets]) , or rewrite your query to perform a GROUP EACH
BY and count the results.
For example, say your original query is:
SELECT COUNT(DISTINCT word)
FROM [publicdata:samples.shakespeare]
The exact value is 32,786, but the value returned by the preceding query is
31,719. One option for rewriting the query is just to specify a bucket count
larger than the expected value. There are only approximately 50,000 words
in the English language, so a bucket count of 50,000 should be reasonable.
SELECT COUNT(DISTINCT word, 50000)
FROM [publicdata:samples.shakespeare]
This gives the exact value. Specifying a higher bucket count has limitations;
however, each bucket takes up memory, and too many buckets can cause a
Resources Exceeded error. If you want to remove any uncertainty, you can
perform a GROUP BY operation:
SELECT COUNT(*)
FROM (
SELECT word
FROM [publicdata:samples.shakespeare]
GROUP BY word)
The downside to the GROUP BY approach is that it may make it more
difficult to compute multiple values in a single query; that is, if you were also
trying to compute the sum of the word_count field as well, you'd have to
make sure that count would propagate through the GROUP BY , as in:
Search WWH ::




Custom Search