Databases Reference
In-Depth Information
Star Joins
A common structure for data mining of commercial data is the star join.
For example, a chain store like Walmart keeps a fact table whose tu-
ples each represent a single sale. This relation looks like F (A 1 , A 2 , . . . ),
where each attribute A i is a key representing one of the important com-
ponents of the sale, such as the purchaser, the item purchased, the store
branch, or the date. For each key attribute there is a dimension table
giving information about the participant. For instance, the dimension ta-
ble D(A 1 , B 11 , B 12 , . . . ) might represent purchasers. A 1 is the purchaser
ID, the key for this relation. The B 1i 's might give the purchaser's name,
address, phone, and so on. Typically, the fact table is much larger than
the dimension tables. For instance, there might be a fact table of a billion
tuples and ten dimension tables of a million tuples each.
Analysts mine this data by asking analytic queries that typically join
the fact table with several of the dimension tables (a “star join”) and then
aggregate the result into a useful form. For instance, an analyst might ask
“give me a table of sales of pants, broken down by region and color, for
each month of 2010.” Under the communication-cost model of this section,
joining the fact table and dimension tables by a multiway join is almost
certain to be more e cient than joining the relations in pairs. In fact, it
may make sense to store the fact table over however many compute nodes
are available, and replicate the dimension tables permanently in exactly
the same way as we would replicate them should we take the join of the
fact table and all the dimension tables. In this special case, only the
key attributes (the A's above) are hashed to buckets, and the number of
buckets for each key attribute is inversely proportional to the size of its
dimension table.
(b) The union algorithm of Section 2.3.6.
(c) The aggregation algorithm of Section 2.3.9.
(d) The matrix-multiplication algorithm of Section 2.3.11.
! Exercise 2.5.2 : Suppose relations R, S, and T have sizes r, s, and t, respec-
tively, and we want to take the 3-way join R(A, B) ⊲⊳ S(B, C) ⊲⊳ T (A, C), using
k Reduce tasks. We shall hash values of attributes A, B, and C to a, b, and
c buckets, respectively, where abc = k. Each Reduce task is associated with a
vector of buckets, one for each of the three hash functions. Find, as a function
of r, s, t, and k, the values of a, b, and c that minimize the communication cost
of the algorithm.
! Exercise 2.5.3 : Suppose we take a star join of a fact table F (A 1 , A 2 , . . . , A m )
with dimension tables D i (A i , B i ) for i = 1, 2, . . . , m.
Let there be k Reduce
Search WWH ::




Custom Search