Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

same shard. This enables the JOIN operation to proceed in the shards and

in parallel, in a method similar to the broadcast JOIN .

Now that we've discussed the basic operation, we can consider an example,

using the same query as in the broadcast JOIN example, but with JOIN

EACH instead of JOIN . Note that this query doesn't need to use a shuffled

JOIN , but it is useful to do so to compare its execution with the broadcast

JOIN version. Here is the query:

SELECT wiki.title

FROM [publicdata:samples.wikipedia] as wiki

JOIN EACH(

SELECT word

FROM [publicdata:samples.shakespeare]

GROUP EACH BY word) as shakes

ON wiki.title = shakes.word

The steps involved to execute this query are:

1. The mixer receives the query and plans the execution. This query

requires five phases:

a. The Shakespeare table on the right must be shuffled prior to the

GROUP BY .

b. The GROUP BY subquery on the right must be run to compute the

distinct Shakespeare words.

c. The right table must be shuffled prior to the JOIN .

d. The left table must be shuffled prior to the JOIN .

e. Finally, the JOIN results can be computed in the shards.

To start things off, the mixer dispatches the first part of the query (the

shuffle of the Shakespeare table) to the shards.

2. The shards receive the request to shuffle the

publicdata:samples.shakespeare table into 100 partitions based

on the word field. Each shard computes a hash key for each word value

and sends the hashed row to a different shard indicated by the hash. For

the word “peace” for example, the hash function returns the value

7146922576. This would get sent to shard 76, because there are 100

Search WWH ::

Custom Search

Home