MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

the output of the join is large, then there is probably some aggregation being done to reduce

the size of the output. It will be necessary to communicate the result of the join to another

collection of tasks that perform this aggregation, and thus the communication cost will be

at least proportional to the computation needed to produce the output of the join.

□

2.5.2

Wall-Clock Time

While communication cost often influences our choice of algorithm to use in a cluster-com-

puting environment, we must also be aware of the importance of wall-clock time , the time

it takes a parallel algorithm to finish. Using careless reasoning, one could minimize total

communication cost by assigning all the work to one task, and thereby minimize total com-

munication. However, the wall-clock time of such an algorithm would be quite high. The

algorithms we suggest, or have suggested so far, have the property that the work is divided

fairly among the tasks. Therefore, the wall-clock time would be approximately as small as

it could be, given the number of compute nodes available.

2.5.3

Multiway Joins

To see how analyzing the communication cost can help us choose an algorithm in the

cluster-computing environment, we shall examine carefully the case of a multiway join.

There is a general theory in which we:

(1) Select certain attributes of the relations involved in the natural join of three or more

relations to have their values hashed, each to some number of buckets.

(2) Select the number of buckets for each of these attributes, subject to the constraint that

the product of the numbers of buckets for each attribute is k , the number of reducers

that will be used.

(3) Identify each of the k reducers with a vector of bucket numbers. These vectors have

one component for each of the attributes selected at step (1).

(4) Send tuples of each relation to all those reducers where it might find tuples to join

with. That is, the given tuple t will have values for some of the attributes selected at

step (1), so we can apply the hash function(s) to those values to determine certain com-

ponents of the vector that identifies the reducers. Other components of the vector are

unknown, so t must be sent to reducers for all vectors having any value in these un-

known components.

Some examples of this general technique appear in the exercises.

Search WWH ::

Custom Search

Home