Database Reference
In-Depth Information
same shard. This enables the
JOIN
operation to proceed in the shards and
in parallel, in a method similar to the broadcast
JOIN
.
Now that we've discussed the basic operation, we can consider an example,
using the same query as in the broadcast
JOIN
example, but with
JOIN
EACH
instead of
JOIN
. Note that this query doesn't
need
to use a shuffled
JOIN
, but it is useful to do so to compare its execution with the broadcast
JOIN
version. Here is the query:
SELECT wiki.title
FROM [publicdata:samples.wikipedia] as wiki
JOIN EACH(
SELECT word
FROM [publicdata:samples.shakespeare]
GROUP EACH BY word) as shakes
ON wiki.title = shakes.word
The steps involved to execute this query are:
1. The mixer receives the query and plans the execution. This query
requires five phases:
a. The Shakespeare table on the right must be shuffled prior to the
GROUP BY
.
b. The
GROUP BY
subquery on the right must be run to compute the
distinct Shakespeare words.
c. The right table must be shuffled prior to the
JOIN
.
d. The left table must be shuffled prior to the
JOIN
.
e. Finally, the
JOIN
results can be computed in the shards.
To start things off, the mixer dispatches the first part of the query (the
shuffle of the Shakespeare table) to the shards.
2. The shards receive the request to shuffle the
publicdata:samples.shakespeare
table into 100 partitions based
on the
word
field. Each shard computes a hash key for each
word
value
and sends the hashed row to a different shard indicated by the hash. For
the word “peace” for example, the hash function returns the value
7146922576. This would get sent to shard 76, because there are 100