Database Reference
In-Depth Information
column referenced in the query. As it scans the table, the shard walks
through the opened column files in parallel, one row at a time.
The first portion of the query to be applied is the WHERE clause, which
filters out rows that aren't interesting. In this case, all rows where the
word length is less than 4 are dropped, and the regular expression is
applied to remove words in all caps.
The shards can apply an optimization here—they need to return only
the top five values ordered by word count. To find the top five values,
they use a data structure called a priority queue , which is an efficient
way of keeping track of the top values without having to sort them all in
place.
3. Each shard returns its top 5 results to the mixer. The mixer can also use
a priority queue to keep track of the top five results from the shards,
and after all the shards have completed, it can return results back to the
caller. The mixer doesn't have to worry about the WHERE clause because
all the values it handles should be valid responses to the query.
If there was no ORDER BY operation, the mixer could perform another
optimization; it could return immediately after it had received five valid
results from the shards. But in this case it has to wait until all the shards
have finished because the top five values might be returned only in the
last shard.
GROUP BY Queries
You saw how a simple query that just does a filter and a sort operates. Now
look at a slightly more complex query that does aggregation:
SELECT corpus, SUM(word_count) AS total_words
FROM [publicdata:samples.shakespeare]
WHERE LENGTH(word) > 4
GROUP BY corpus
ORDER BY total_words DESC
LIMIT 5
This query computes the top five Shakespeare plays in terms of total word
count, excluding words shorter than five characters, and returns them in
reverse order. For example, the top result is Richard III , which has 11,514
words, followed by Hamlet at 11,439 words.
Search WWH ::




Custom Search