Large-Scale File Systems and Map-Reduce - Mining of Massive Datasets

Databases Reference

In-Depth Information

Star Joins

A common structure for data mining of commercial data is the star join.

For example, a chain store like Walmart keeps a fact table whose tu-

ples each represent a single sale. This relation looks like F (A 1 , A 2 , . . . ),

where each attribute A i is a key representing one of the important com-

ponents of the sale, such as the purchaser, the item purchased, the store

branch, or the date. For each key attribute there is a dimension table

giving information about the participant. For instance, the dimension ta-

ble D(A 1 , B 11 , B 12 , . . . ) might represent purchasers. A 1 is the purchaser

ID, the key for this relation. The B 1i 's might give the purchaser's name,

address, phone, and so on. Typically, the fact table is much larger than

the dimension tables. For instance, there might be a fact table of a billion

tuples and ten dimension tables of a million tuples each.

Analysts mine this data by asking analytic queries that typically join

the fact table with several of the dimension tables (a “star join”) and then

aggregate the result into a useful form. For instance, an analyst might ask

“give me a table of sales of pants, broken down by region and color, for

each month of 2010.” Under the communication-cost model of this section,

joining the fact table and dimension tables by a multiway join is almost

certain to be more e cient than joining the relations in pairs. In fact, it

may make sense to store the fact table over however many compute nodes

are available, and replicate the dimension tables permanently in exactly

the same way as we would replicate them should we take the join of the

fact table and all the dimension tables. In this special case, only the

key attributes (the A's above) are hashed to buckets, and the number of

buckets for each key attribute is inversely proportional to the size of its

dimension table.

(b) The union algorithm of Section 2.3.6.

(c) The aggregation algorithm of Section 2.3.9.

(d) The matrix-multiplication algorithm of Section 2.3.11.

! Exercise 2.5.2 : Suppose relations R, S, and T have sizes r, s, and t, respec-

tively, and we want to take the 3-way join R(A, B) ⊲⊳ S(B, C) ⊲⊳ T (A, C), using

k Reduce tasks. We shall hash values of attributes A, B, and C to a, b, and

c buckets, respectively, where abc = k. Each Reduce task is associated with a

vector of buckets, one for each of the three hash functions. Find, as a function

of r, s, t, and k, the values of a, b, and c that minimize the communication cost

of the algorithm.

! Exercise 2.5.3 : Suppose we take a star join of a fact table F (A 1 , A 2 , . . . , A m )

with dimension tables D i (A i , B i ) for i = 1, 2, . . . , m.

Let there be k Reduce

Search WWH ::

Custom Search

Home