Database Reference
In-Depth Information
When to Use Shuffled Queries
GROUP EACH BY and JOIN EACH BY queries are extremely powerful
because they enable you to operate over larger and more complex data
sets than you otherwise would be able to process. It is tempting,
therefore, to use the keyword everywhere. However, there are some
disadvantages to using the EACH keyword. Most obvious, using EACH
adds a lot of extra processing to the query, which can slow down
performance significantly. Although it is difficult to estimate how long a
shuffle operation will take, you can count on anywhere from 2 seconds
to 60 seconds added time, depending on table sizes.
What might not be obvious is that there are other cases in which a
shuffle query is slower—when too many results all hash to the same
shard. Because a single shard must process every row that hashes to a
particular value, if a single value is too common, this can overload the
shard and end up taking much longer to process.
A simple example of this would be if you imagine a huge customer table
with a Country field. If 99 percent of your customers are in the United
States, if you do a GROUP EACH BY Country , a single shard needs to
process 99 percent of the rows. This process can take a long time and
can even lead to your queries being canceled.
It is likely that future versions of BigQuery will be smarter about
picking whether to do a shuffle, so you won't have to worry about it. For
now, however, you should probably default to not using the EACH
keyword unless you need it.
Materialize Queries
When you run a normal query in BigQuery, you're limited to results that
are less than 128 MB. If you think about how the query runs in the Dremel
serving tree, you can see why this limit exists—because everything has to be
returned through the root mixer. If you had 100 GB of results, this would be
a lot of data to return from a single server, with a lot of data to aggregate in
one place.
Search WWH ::




Custom Search