Running Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

BigQuery doesn't provide an exact value by default because the exact value

is surprisingly difficult to calculate for a large table. To know whether you've

seen a value before, you have to keep track of all values that you've seen; this

is expensive and difficult to do unless you first sort the data.

If you're not content with the approximate value, you have two options:

Supply a bucket count as an optional argument to COUNT(DISTINCT

field, [buckets]) , or rewrite your query to perform a GROUP EACH

BY and count the results.

For example, say your original query is:

SELECT COUNT(DISTINCT word)

FROM [publicdata:samples.shakespeare]

The exact value is 32,786, but the value returned by the preceding query is

31,719. One option for rewriting the query is just to specify a bucket count

larger than the expected value. There are only approximately 50,000 words

in the English language, so a bucket count of 50,000 should be reasonable.

SELECT COUNT(DISTINCT word, 50000)

FROM [publicdata:samples.shakespeare]

This gives the exact value. Specifying a higher bucket count has limitations;

however, each bucket takes up memory, and too many buckets can cause a

Resources Exceeded error. If you want to remove any uncertainty, you can

perform a GROUP BY operation:

SELECT COUNT(*)

FROM (

SELECT word

FROM [publicdata:samples.shakespeare]

GROUP BY word)

The downside to the GROUP BY approach is that it may make it more

difficult to compute multiple values in a single query; that is, if you were also

trying to compute the sum of the word_count field as well, you'd have to

make sure that count would propagate through the GROUP BY , as in:

Search WWH ::

Custom Search

Home