Running Queries - Google BigQuery Analytics

Database Reference

In-Depth Information

You may be wondering why this behavior was chosen. When Dremel (the

query engine behind BigQuery) was first developed, JOIN operations were

not supported. However, there were a lot of users who had tables broken

up by day; they often wanted to query across several of them. To make the

syntax for querying across multiple tables as simple as possible, the Dremel

SQL designers used a comma-separated list of tables to concatenate tables.

This syntax wasn't ambiguous because JOIN didn't exist. Now that JOIN is

supported, however, the syntax creates confusion. However, because there

are now a lot of users who rely on the comma-separated list to mean UNION

ALL , it would be difficult to change.

JOIN EACH and GROUP EACH

If you run a query that has a GROUP BY operation where the number of

resulting rows is high, the query may fail with a Resources Exceeded error.

Likewise, if you try to do a JOIN operation on two large tables, you may get

the same error. In both of these cases, the EACH keyword can come to the

rescue.

EACH is a hint to the BigQuery query optimizer that instructs it to perform a

shuffle operation . Shuffle is described in detail in Chapter 9, but for now it

can be described as sorting the data to process more of the data in parallel.

At some time in the future, BigQuery may be smart enough to infer the table

size so that you won't need to use a qualifier, but as of this writing, the EACH

keyword is required when performing “big” JOIN s or certain GROUP BY s.

It isn't recommended to just add EACH to all your queries; for many types

of queries ( JOIN of one large table against a smaller one, or GROUP BY

with only a few distinct values), using EACH may be significantly slower. A

reasonable rule of thumb is that if the smaller table in the JOIN is more than

1 million rows, or the GROUP BY has more than a million distinct values,

you're probably better off using EACH .

Approximate Values: COUNT DISTINCT

Another BigQuery deviation from standard SQL is the behavior of COUNT

DISTINCT , which returns only an approximate value. This is often

surprising to users who want to get an exact count of the number of distinct

values in a column. The fewer the number of distinct values, the better the

approximation will be; if less than 1000, the number will be exact.

Search WWH ::

Custom Search

Home