Databases Reference
In-Depth Information
2.5.2 Elapsed Communication Cost
There is another measure of cost based on communication that is worth men-
tioning, although we shall not use in the developments of this section. The
elapsed communication cost is the maximum, over all paths through the acyclic
network, of the sum of the communication costs of the tasks along that path.
For example, in a map-reduce job, the elapsed communication cost is the sum
of the maximum input size for any Map task, plus the maximum input size for
any Reduce task.
Elapsed communication cost corresponds to the minimum wall-clock time
for the execution of a parallel algorithm. Using careless reasoning, one could
minimize total communication cost by assigning all the work to one task, and
thereby minimize total communication. However, the elapsed time of such an
algorithm would be quite high. The algorithms we suggest, or have suggested
so far, have the property that the work is divided fairly among the tasks, so the
elapsed communication cost would be approximately as small as it could be.
2.5.3 Multiway Joins
To see how analyzing the communication cost can help us choose an algorithm
in the cluster-computing environment, we shall examine carefully the case of a
multiway join. There is a general theory in which we:
1. Select certain attributes of the relations involved in a natural join to have
their values hashed to some number of buckets.
2. Select the number of buckets for each of these attributes, subject to the
constraint that the product of the numbers of buckets for each attribute
is k, the number of Reduce tasks that will be used.
3. Identify each of the k Reduce tasks with a vector of bucket numbers, one
for each of the hashed attributes.
4. Send tuples of each relation to all those Reduce tasks where it might
find tuples to join with. That is, the given tuple t will have values for
some of the hashed attributes, so we can apply the hash function(s) to
those values to determine certain components of the vector identifying
the Reduce tasks. Other components of the vector are unknown, so t
must be sent to all the Reduce tasks having any value in these unknown
components.
Some examples of this general technique appear in the exercises.
Here, we shall look only at the join R(A, B) ⊲⊳ S(B, C) ⊲⊳ T (C, D). Suppose
that the relations R, S, and T have sizes r, s, and t, respectively, and for
simplicity, suppose that the probability is p that
1. An R-tuple and and S-tuple agree on B, and also the probability that
Search WWH ::




Custom Search