Database Reference
In-Depth Information
Query Execution
The steps necessary to execute this query are:
1. The mixer receives the query. In this case, the shards don't need to
know about the ORDER BY operation because they'll have to return all
of their data. The mixer strips the ORDER BY clause and the LIMIT , and
sends the query to the shards.
2. The shard receives the query and starts reading the corpus and
word_count fields. The shard applies the WHERE clause and computes
the aggregations needed for the GROUP BY . After the shard has totaled
all the word_count values for each corpus , it returns those results to
the mixer.
The amounts of resources used in this step are proportional to
cardinality of the corpus field—that is, the number of unique values. In
a large table, there might be billions of individual values. The shard
needs to keep all of them in memory so that it can perform the
aggregation (the sum of the word_count field for each corpus ). If
there are too many values, it will run out of memory and return an error
complaining that resources have been exceeded. There are ways around
this error—see the section entitled “Shuffled Queries” later in this
chapter.
You might think, at first, that each shard could collect corpus names
and word counts and would have to return only the top five to the
mixer. However, this isn't the case; the shard has to return all the
results. Imagine a case in which a certain corpus didn't score highly
when its word count was computed by any individual shard, but after all
the shards aggregated their results together, the corpus would end up
with the highest word count. In other words, because the data may be
unevenly distributed, all the results need to be aggregated in the mixer.
3. The mixer receives the results from the shards. It then needs to merge
the aggregates. That is, if shard 1 finds 1,500 words in Hamlet and
shard 2 finds 1,000 words in Hamlet , the mixer needs to compute the
running total. After all the shards have returned, the mixer takes the top
five values (it needs to sort all the values this time—it can't use the
priority queue trick you saw in the last query, for the same reason that
the shards had to return all of their data) and returns them to the caller.
Search WWH ::




Custom Search