Database Reference
In-Depth Information
values of the lookup table are the same as the keys (because word is the
only field needed by the query).
After the lookup table has been created, the shard iterates through each
row of its portion of the Wikipedia table. For each row, it takes the
title field and looks for it in the lookup table. If it does not exist, it
skips the row. If it does exist, it adds a row to the results. The only field
needed in the results in this example is the title , but in other cases
there could be fields from the RIGHT table or other computed fields that
would go in the results.
6. All of the work to compute the JOIN can be done in the shards, so all
the mixer has to do is collect the results and return them to the caller.
Shuffled Queries
As previously described, both the GROUP BY and the JOIN operations have
some limitations; GROUP BY cannot handle cases in which there are large
numbers of distinct values, and JOIN can't handle cases in which one of the
tables is larger than 8 MB. The good news is that there is a mechanism that
works around both of these problems; the bad news is that this mechanism
requires a slight syntax change.
In both the GROUP BY case and the JOIN case, you need to aggregate data in
the mixer because the shards don't have enough information to compute the
results. For example, if you were grouping by corpus, multiple shards might
see rows for the corpus Hamlet , so the partial result has to be passed back
to the mixer. What if, however, the underlying data were sorted by corpus,
and a single shard processed all the Hamlet entries? If this was the case,
you could perform the GROUP BY operation in the shards, and it wouldn't
matter how many distinct corpus values you had because you wouldn't have
a bottleneck at the mixer.
When you use GROUP EACH BY instead of GROUP BY , this tells Dremel to
perform a shuffle operation. If you've used Hadoop or another MapReduce
system, the shuffle step may look familiar; the hidden step in a MapReduce
is shuffle, which sorts all the data before passing it to the reducer.
Dremel's shuffle is a little bit different from Hadoop's shuffle; the latter
performs a merge sort of all the keys in the dataset. Dremel doesn't care
about the ordering, however; it performs a hash partitioning of the data. The
Search WWH ::




Custom Search