Database Reference
In-Depth Information
You may be wondering why this behavior was chosen. When Dremel (the
query engine behind BigQuery) was first developed, JOIN operations were
not supported. However, there were a lot of users who had tables broken
up by day; they often wanted to query across several of them. To make the
syntax for querying across multiple tables as simple as possible, the Dremel
SQL designers used a comma-separated list of tables to concatenate tables.
This syntax wasn't ambiguous because JOIN didn't exist. Now that JOIN is
supported, however, the syntax creates confusion. However, because there
are now a lot of users who rely on the comma-separated list to mean UNION
ALL , it would be difficult to change.
JOIN EACH and GROUP EACH
If you run a query that has a GROUP BY operation where the number of
resulting rows is high, the query may fail with a Resources Exceeded error.
Likewise, if you try to do a JOIN operation on two large tables, you may get
the same error. In both of these cases, the EACH keyword can come to the
rescue.
EACH is a hint to the BigQuery query optimizer that instructs it to perform a
shuffle operation . Shuffle is described in detail in Chapter 9, but for now it
can be described as sorting the data to process more of the data in parallel.
At some time in the future, BigQuery may be smart enough to infer the table
size so that you won't need to use a qualifier, but as of this writing, the EACH
keyword is required when performing “big” JOIN s or certain GROUP BY s.
It isn't recommended to just add EACH to all your queries; for many types
of queries ( JOIN of one large table against a smaller one, or GROUP BY
with only a few distinct values), using EACH may be significantly slower. A
reasonable rule of thumb is that if the smaller table in the JOIN is more than
1 million rows, or the GROUP BY has more than a million distinct values,
you're probably better off using EACH .
Approximate Values: COUNT DISTINCT
Another BigQuery deviation from standard SQL is the behavior of COUNT
DISTINCT , which returns only an approximate value. This is often
surprising to users who want to get an exact count of the number of distinct
values in a column. The fewer the number of distinct values, the better the
approximation will be; if less than 1000, the number will be exact.
Search WWH ::




Custom Search