Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

goal of these two approaches is the same, but a hash partitioning is less work

than a sort.

The way the hash partitioning works is that each value is assigned a key.

A hash function is applied in a source shard to the key to turn it into a

number. That number, the hash code, is used to assign the entire row to a

destination shard. The source shard then sends the row over to the network

to the assigned destination shard. Because the workers are assigned stable

ranges of the hash key space, a single worker gets all the values that have the

same key.

For GROUP EACH BY operations, the shuffle key is the field or fields used

for the grouping. That is, if you do a GROUP EACH BY corpus on the

Shakespeare table, all the rows that have Hamlet as the corpus hash to the

same value and are sent to the same shard. All the rows that have the corpus

of Macbeth will hash to a different value, so they will be sent to a different

shard than the Hamlet rows. Figure 9.4 shows how the shuffle operation

works. Each shard first reads the data from storage, then forwards that data

on to a diferent shard depending on the value of the shuffle key.

Figure 9.4 Shuffle operation

This same trick also works for JOIN , but both sides of the JOIN need to be

shuffled. To specify a shuffled JOIN , you can use the syntax JOIN EACH

where you would otherwise have used JOIN , as in LEFT OUTER JOIN

EACH . The shuffle key is the field or fields used in the ON clause. That

is, if your query looks like JOIN EACH . . . ON left.key1 =

right.key2 , the left table is shuffled by the key1 field, and the right table

is shuffled by the key2 field. Since the same hash function is used on both,

all of the rows with matching values from either table will end up in the

Search WWH ::

Custom Search

Home