MapReduce and the New Software Stack - Mining of Massive Datasets

Database Reference

In-Depth Information

! EXERCISE 2.5.2 Suppose relations R , S , and T have sizes r , s , and t , respectively, and we

want to take the 3-way join R ( A, B ) ⋈ S ( B, C ) ⋈ T ( A, C ), using k reducers. We shall hash

values of attributes A , B , and C to a , b , and c buckets, respectively, where abc = k . Each re-

ducer is associated with a vector of buckets, one for each of the three hash functions. Find,

as a function of r , s , t , and k , the values of a , b , and c that minimize the communication cost

of the algorithm.

Star Joins

A common structure for data mining of commercial data is the star join . For example, a chain store like Walmart keeps

a fact table whose tuples each represent a single sale. This relation looks like F ( A 1 , A 2 , . . . ), where each attribute A i

is a key representing one of the important components of the sale, such as the purchaser, the item purchased, the store

branch, or the date. For each key attribute there is a dimension table giving information about the participant. For in-

stance, the dimension table D ( A 1 , B 11 , B 12 , . . . ) might represent purchasers. A 1 is the purchaser ID, the key for this

relation. The B 1 i might give the purchaser's name, address, phone, and so on. Typically, the fact table is much larger

than the dimension tables. For instance, there might be a fact table of a billion tuples and ten dimension tables of a

million tuples each.

Analysts mine this data by asking analytic queries that typically join the fact table with several of the dimension

tables (a “star join”) and then aggregate the result into a useful form. For instance, an analyst might ask “give me a

table of sales of pants, broken down by region and color, for each month of 2012.” Under the communication-cost

model of this section, joining the fact table and dimension tables by a multiway join is almost certain to be more effi-

cient than joining the relations in pairs. In fact, it may make sense to store the fact table over however many compute

nodes are available, and replicate the dimension tables permanently in exactly the same way as we would replicate

them should we take the join of the fact table and all the dimension tables. In this special case, only the key attributes

(the A 's above) are hashed to buckets, and the number of buckets for each key attribute is proportional to the size of

its dimension table.

! EXERCISE 2.5.3 Suppose we take a star join of a fact table F ( A 1 , A 2 , . . . , A m ) with dimen-

sion tables D i ( A i , B i ) for i = 1 , 2 , . . . , m . Let there be k reducers, each associated with a

vector of buckets, one for each of the key attributes A 1 , A 2 , . . . , A m . Suppose the number

of buckets into which we hash A i is a i . Naturally, a 1 a 2 · · · a m = k . Finally, suppose each

dimension table D i has size d i , and the size of the fact table is much larger than any of these

sizes. Find the values of the a i that minimize the cost of taking the star join as one MapRe-

duce operation.

2.6 Complexity Theory for MapReduce

Now, we shall explore the design of MapReduce algorithms in more detail. Section 2.5 in-

troduced the idea that communication between the Map and Reduce tasks often accounts

for the largest fraction of the time spent by these tasks. Here, we shall look at how the com-

munication cost relates to other desiderata for MapReduce algorithms, in particular our de-

Search WWH ::

Custom Search

Home