Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

Query Execution

The steps necessary to execute this query are:

1. The mixer receives the query. In this case, the shards don't need to

know about the ORDER BY operation because they'll have to return all

of their data. The mixer strips the ORDER BY clause and the LIMIT , and

sends the query to the shards.

2. The shard receives the query and starts reading the corpus and

word_count fields. The shard applies the WHERE clause and computes

the aggregations needed for the GROUP BY . After the shard has totaled

all the word_count values for each corpus , it returns those results to

the mixer.

The amounts of resources used in this step are proportional to

cardinality of the corpus field—that is, the number of unique values. In

a large table, there might be billions of individual values. The shard

needs to keep all of them in memory so that it can perform the

aggregation (the sum of the word_count field for each corpus ). If

there are too many values, it will run out of memory and return an error

complaining that resources have been exceeded. There are ways around

this error—see the section entitled “Shuffled Queries” later in this

chapter.

You might think, at first, that each shard could collect corpus names

and word counts and would have to return only the top five to the

mixer. However, this isn't the case; the shard has to return all the

results. Imagine a case in which a certain corpus didn't score highly

when its word count was computed by any individual shard, but after all

the shards aggregated their results together, the corpus would end up

with the highest word count. In other words, because the data may be

unevenly distributed, all the results need to be aggregated in the mixer.

3. The mixer receives the results from the shards. It then needs to merge

the aggregates. That is, if shard 1 finds 1,500 words in Hamlet and

shard 2 finds 1,000 words in Hamlet , the mixer needs to compute the

running total. After all the shards have returned, the mixer takes the top

five values (it needs to sort all the values this time—it can't use the

priority queue trick you saw in the last query, for the same reason that

the shards had to return all of their data) and returns them to the caller.

Search WWH ::

Custom Search

Home