Database Reference
In-Depth Information
the output of the join is large, then there is probably some aggregation being done to reduce
the size of the output. It will be necessary to communicate the result of the join to another
collection of tasks that perform this aggregation, and thus the communication cost will be
at least proportional to the computation needed to produce the output of the join.
2.5.2
Wall-Clock Time
While communication cost often influences our choice of algorithm to use in a cluster-com-
puting environment, we must also be aware of the importance of wall-clock time , the time
it takes a parallel algorithm to finish. Using careless reasoning, one could minimize total
communication cost by assigning all the work to one task, and thereby minimize total com-
munication. However, the wall-clock time of such an algorithm would be quite high. The
algorithms we suggest, or have suggested so far, have the property that the work is divided
fairly among the tasks. Therefore, the wall-clock time would be approximately as small as
it could be, given the number of compute nodes available.
2.5.3
Multiway Joins
To see how analyzing the communication cost can help us choose an algorithm in the
cluster-computing environment, we shall examine carefully the case of a multiway join.
There is a general theory in which we:
(1) Select certain attributes of the relations involved in the natural join of three or more
relations to have their values hashed, each to some number of buckets.
(2) Select the number of buckets for each of these attributes, subject to the constraint that
the product of the numbers of buckets for each attribute is k , the number of reducers
that will be used.
(3) Identify each of the k reducers with a vector of bucket numbers. These vectors have
one component for each of the attributes selected at step (1).
(4) Send tuples of each relation to all those reducers where it might find tuples to join
with. That is, the given tuple t will have values for some of the attributes selected at
step (1), so we can apply the hash function(s) to those values to determine certain com-
ponents of the vector that identifies the reducers. Other components of the vector are
unknown, so t must be sent to reducers for all vectors having any value in these un-
known components.
Some examples of this general technique appear in the exercises.
Search WWH ::




Custom Search