Database Reference
In-Depth Information
values of the lookup table are the same as the keys (because
word
is the
only field needed by the query).
After the lookup table has been created, the shard iterates through each
row of its portion of the Wikipedia table. For each row, it takes the
title
field and looks for it in the lookup table. If it does not exist, it
skips the row. If it does exist, it adds a row to the results. The only field
needed in the results in this example is the
title
, but in other cases
there could be fields from the
RIGHT
table or other computed fields that
would go in the results.
6. All of the work to compute the
JOIN
can be done in the shards, so all
the mixer has to do is collect the results and return them to the caller.
Shuffled Queries
As previously described, both the
GROUP BY
and the
JOIN
operations have
some limitations;
GROUP BY
cannot handle cases in which there are large
numbers of distinct values, and
JOIN
can't handle cases in which one of the
tables is larger than 8 MB. The good news is that there is a mechanism that
works around both of these problems; the bad news is that this mechanism
requires a slight syntax change.
In both the
GROUP BY
case and the
JOIN
case, you need to aggregate data in
the mixer because the shards don't have enough information to compute the
results. For example, if you were grouping by corpus, multiple shards might
see rows for the corpus
Hamlet
, so the partial result has to be passed back
to the mixer. What if, however, the underlying data were sorted by corpus,
and a single shard processed all the
Hamlet
entries? If this was the case,
you could perform the
GROUP BY
operation in the shards, and it wouldn't
matter how many distinct corpus values you had because you wouldn't have
a bottleneck at the mixer.
When you use
GROUP EACH BY
instead of
GROUP BY
, this tells Dremel to
perform a
shuffle
operation. If you've used Hadoop or another MapReduce
system, the shuffle step may look familiar; the hidden step in a MapReduce
is shuffle, which sorts all the data before passing it to the reducer.
Dremel's shuffle is a little bit different from Hadoop's shuffle; the latter
performs a merge sort of all the keys in the dataset. Dremel doesn't care
about the ordering, however; it performs a hash partitioning of the data. The