Database Reference
In-Depth Information
are needed to enable enhanced decision making, insight, knowledge discovery, and
process optimization. To handle Big Data, researchers proposed the use of a high-
level programming model—called MapReduce —to process high volumes of data
by using parallel and distributed computing [ 54 ] on large clusters or grids of nodes
(i.e., commodity machines), which consist of a master node and multiple worker
nodes. As implied by its name, MapReduce involves two key functions: “map” and
“reduce”. An advantage of using the Map-Reduce model is that users only need
to focus on (and specify) these “map” and “reduce” functions—without worrying
about implementation details for (i) partitioning the input data, (ii) scheduling and
executing the program across multiple machines, (iii) handling machine failures, or
(iv) managing inter-machine communication.
To mine frequent patterns from Big probabilistic datasets of uncertain data, Leung
and Hayduk [ 37 ] proposed the MR-growth algorithm. The algorithm uses Map-
Reduce—by applying two sets of the “map” and “reduce” functions—in a pattern-
growth environment. Specifically, the master node reads and divides a probabilistic
dataset D of uncertain data into partitions, and then assigns them to different worker
nodes. The worker node corresponding to each partition P j (where D
= j P j ) then
outputs a pair consisting of an item x and its existential probability P ( x , t i )—i.e.,
x , P ( x , t i )
—for every item x in transaction t i assigned to P j as intermediate results:
ID of t i in P j , contents of t i
map:
list of
x
t i , P ( x , t i )
.
(14.6)
Afterwards,
pairs in the list (i.e., intermediate results) are
shuffled and sorted (e.g., grouped by x ). Each worker node then executes the
“reduce” function, which (i) “reduces”—by summing—all the P ( x , t i ) values for
each item x so as to compute its expected support expSup (
these
x , P ( x , t i )
{
x
}
, D ) and (ii) out-
puts
{
x
}
, expSup (
{
x
}
, D )
(representing a frequent 1-itemset
{
x
}
and its expected
support) if expSup (
{
x
}
, D )
minsup :
reduce:
x , list of P ( x , t i )
list of
frequent 1-itemset
{
x
}
, expSup (
{
x
}
, D )
,
(14.7)
where expSup (
, D ) = sum of P ( x , t i ) in the list for an item x .
Afterwards, MR-growth rereads the datasets to form a
{
x
}
-projected database
(i.e., a collection of transactions containing x ) for each item x in the list produced
by the first reduce function (i.e., for each frequent 1-itemset
{
x
}
). The worker node
corresponding to each projected database then (i) builds appropriate local UF-trees
(based on the projected database assigned to the node) to mine frequent k -itemsets
(for k
{
x
}
(which represents a frequent k -
itemset X and its expected support) if expSup ( X , D )
2) and (ii) outputs
X , expSup ( X , D )
minsup . In other words,
MR-growth executes the second set of “map” and “reduce” functions as follows:
map:
ID of t i in P j , contents of t i
list of
{ x }
,
{ x }
-proj. DB
;
(14.8)
Search WWH ::




Custom Search