Understanding Query Execution - Google BigQuery Analytics

Database Reference

In-Depth Information

values of the lookup table are the same as the keys (because word is the

only field needed by the query).

After the lookup table has been created, the shard iterates through each

row of its portion of the Wikipedia table. For each row, it takes the

title field and looks for it in the lookup table. If it does not exist, it

skips the row. If it does exist, it adds a row to the results. The only field

needed in the results in this example is the title , but in other cases

there could be fields from the RIGHT table or other computed fields that

would go in the results.

6. All of the work to compute the JOIN can be done in the shards, so all

the mixer has to do is collect the results and return them to the caller.

Shuffled Queries

As previously described, both the GROUP BY and the JOIN operations have

some limitations; GROUP BY cannot handle cases in which there are large

numbers of distinct values, and JOIN can't handle cases in which one of the

tables is larger than 8 MB. The good news is that there is a mechanism that

works around both of these problems; the bad news is that this mechanism

requires a slight syntax change.

In both the GROUP BY case and the JOIN case, you need to aggregate data in

the mixer because the shards don't have enough information to compute the

results. For example, if you were grouping by corpus, multiple shards might

see rows for the corpus Hamlet , so the partial result has to be passed back

to the mixer. What if, however, the underlying data were sorted by corpus,

and a single shard processed all the Hamlet entries? If this was the case,

you could perform the GROUP BY operation in the shards, and it wouldn't

matter how many distinct corpus values you had because you wouldn't have

a bottleneck at the mixer.

When you use GROUP EACH BY instead of GROUP BY , this tells Dremel to

perform a shuffle operation. If you've used Hadoop or another MapReduce

system, the shuffle step may look familiar; the hidden step in a MapReduce

is shuffle, which sorts all the data before passing it to the reducer.

Dremel's shuffle is a little bit different from Hadoop's shuffle; the latter

performs a merge sort of all the keys in the dataset. Dremel doesn't care

about the ordering, however; it performs a hash partitioning of the data. The

Search WWH ::

Custom Search

Home