Uncertain Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

are needed to enable enhanced decision making, insight, knowledge discovery, and

process optimization. To handle Big Data, researchers proposed the use of a high-

level programming model—called MapReduce —to process high volumes of data

by using parallel and distributed computing [ 54 ] on large clusters or grids of nodes

(i.e., commodity machines), which consist of a master node and multiple worker

nodes. As implied by its name, MapReduce involves two key functions: “map” and

“reduce”. An advantage of using the Map-Reduce model is that users only need

to focus on (and specify) these “map” and “reduce” functions—without worrying

about implementation details for (i) partitioning the input data, (ii) scheduling and

executing the program across multiple machines, (iii) handling machine failures, or

(iv) managing inter-machine communication.

To mine frequent patterns from Big probabilistic datasets of uncertain data, Leung

and Hayduk [ 37 ] proposed the MR-growth algorithm. The algorithm uses Map-

Reduce—by applying two sets of the “map” and “reduce” functions—in a pattern-

growth environment. Specifically, the master node reads and divides a probabilistic

dataset D of uncertain data into partitions, and then assigns them to different worker

nodes. The worker node corresponding to each partition P j (where D

= j P j ) then

outputs a pair consisting of an item x and its existential probability P ( x , t i )—i.e.,

x , P ( x , t i )

—for every item x in transaction t i assigned to P j as intermediate results:

ID of t i in P j , contents of t i

map:

→

list of

∈

t i , P ( x , t i )

(14.6)

Afterwards,

pairs in the list (i.e., intermediate results) are

shuffled and sorted (e.g., grouped by x ). Each worker node then executes the

“reduce” function, which (i) “reduces”—by summing—all the P ( x , t i ) values for

each item x so as to compute its expected support expSup (

these

x , P ( x , t i )

{

}

, D ) and (ii) out-

puts

{

}

, expSup (

{

}

, D )

(representing a frequent 1-itemset

{

}

and its expected

support) if expSup (

{

}

, D )

≥

minsup :

reduce:

x , list of P ( x , t i )

→

list of

frequent 1-itemset

{

}

, expSup (

{

}

, D )

(14.7)

where expSup (

, D ) = sum of P ( x , t i ) in the list for an item x .

Afterwards, MR-growth rereads the datasets to form a

{

}

-projected database

(i.e., a collection of transactions containing x ) for each item x in the list produced

by the first reduce function (i.e., for each frequent 1-itemset

{

}

). The worker node

corresponding to each projected database then (i) builds appropriate local UF-trees

(based on the projected database assigned to the node) to mine frequent k -itemsets

(for k

{

}

(which represents a frequent k -

itemset X and its expected support) if expSup ( X , D )

≥

2) and (ii) outputs

X , expSup ( X , D )

minsup . In other words,

MR-growth executes the second set of “map” and “reduce” functions as follows:

≥

map:

ID of t i in P j , contents of t i

→

list of

{ x }

-proj. DB

;

(14.8)

Frequent Pattern Mining

Search WWH ::

Custom Search

Home