Database Reference
In-Depth Information
! EXERCISE 2.5.2 Suppose relations R , S , and T have sizes r , s , and t , respectively, and we
want to take the 3-way join R ( A, B ) S ( B, C ) T ( A, C ), using k reducers. We shall hash
values of attributes A , B , and C to a , b , and c buckets, respectively, where abc = k . Each re-
ducer is associated with a vector of buckets, one for each of the three hash functions. Find,
as a function of r , s , t , and k , the values of a , b , and c that minimize the communication cost
of the algorithm.
Star Joins
A common structure for data mining of commercial data is the star join . For example, a chain store like Walmart keeps
a fact table whose tuples each represent a single sale. This relation looks like F ( A 1 , A 2 , . . . ), where each attribute A i
is a key representing one of the important components of the sale, such as the purchaser, the item purchased, the store
branch, or the date. For each key attribute there is a dimension table giving information about the participant. For in-
stance, the dimension table D ( A 1 , B 11 , B 12 , . . . ) might represent purchasers. A 1 is the purchaser ID, the key for this
relation. The B 1 i might give the purchaser's name, address, phone, and so on. Typically, the fact table is much larger
than the dimension tables. For instance, there might be a fact table of a billion tuples and ten dimension tables of a
million tuples each.
Analysts mine this data by asking analytic queries that typically join the fact table with several of the dimension
tables (a “star join”) and then aggregate the result into a useful form. For instance, an analyst might ask “give me a
table of sales of pants, broken down by region and color, for each month of 2012.” Under the communication-cost
model of this section, joining the fact table and dimension tables by a multiway join is almost certain to be more effi-
cient than joining the relations in pairs. In fact, it may make sense to store the fact table over however many compute
nodes are available, and replicate the dimension tables permanently in exactly the same way as we would replicate
them should we take the join of the fact table and all the dimension tables. In this special case, only the key attributes
(the A 's above) are hashed to buckets, and the number of buckets for each key attribute is proportional to the size of
its dimension table.
! EXERCISE 2.5.3 Suppose we take a star join of a fact table F ( A 1 , A 2 , . . . , A m ) with dimen-
sion tables D i ( A i , B i ) for i = 1 , 2 , . . . , m . Let there be k reducers, each associated with a
vector of buckets, one for each of the key attributes A 1 , A 2 , . . . , A m . Suppose the number
of buckets into which we hash A i is a i . Naturally, a 1 a 2 · · · a m = k . Finally, suppose each
dimension table D i has size d i , and the size of the fact table is much larger than any of these
sizes. Find the values of the a i that minimize the cost of taking the star join as one MapRe-
duce operation.
2.6 Complexity Theory for MapReduce
Now, we shall explore the design of MapReduce algorithms in more detail. Section 2.5 in-
troduced the idea that communication between the Map and Reduce tasks often accounts
for the largest fraction of the time spent by these tasks. Here, we shall look at how the com-
munication cost relates to other desiderata for MapReduce algorithms, in particular our de-
Search WWH ::




Custom Search