Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Solving the FSM problem via lexicographic tree projection in a distributed

message-passing environment, Guralnik and Karypis proposed a Data Parallel For-

mulation (DPF) similar to the Count Distribution strategy described in Sect. 4 .It

partitions the input equally among the processes and requires each node to keep

track of the lexicographic tree representing frequent candidates and the associated

count matrices. Each node projects its local set of sequences unto level ( k

−

1) of

the tree to determine local supports for candidate itemsets at the ( k

1)th level.

A reduction operation facilitates computing global support of candidate itemsets.

Those meeting the minimum support threshold σ are then broadcast to all processes

before continuing to the next level. For problems with large numbers of candidates,

where the global tree and its associated count matrices do not fit in the local memory

of a process, DPF has to partition the tree and perform multiple scans of the local

database at each iteration. This added I/O overhead makes this formulation imprac-

tical in the Big Data context. The authors also showed that DPF's parallel efficiency

will decrease as the number of processes increase, even when the amount of work in-

creases, due to the overhead of maintaining the global tree at each process. To account

for these limitations, they propose Static and Dynamic Task-Parallel Formulations

(STPF and DTPF), which partition both the input data and the lexicographic tree

during computation.

Both task-parallel formulations first use DPF to expand the tree up to a level

+

k

1)-level node requires count data from its

parent node. Therefore, nodes at level k are partitioned among the processes for fur-

ther processing, along with their projected databases . A node's projected database

contains a subset of the sequence database

+

1, k> 0. Further expanding a ( k

+

s.t. each item in the itemsets of each

transaction is still viable to extend the sequence represented by the node into a pos-

sible frequent sequence. These viable items are called active items . The question

remains how to partition the k -level nodes among the processes. The authors pro-

posed two static approaches. The first uses a bin-packing algorithm based on relative

computation time estimates for expanding the nodes, computed as the sum of their

children's corresponding sequential patterns' support. The second approach aims to

minimize the overlap in the nodes' projected databases via repeated minimum-cut

partitioning of the bipartite graph formed by the set of nodes and the set of itemsets

at level ( k

D

1) in the tree. The authors found that the bipartite graph partition-

ing approach is able to substantially reduce the overlap in projected databases and

lead to smaller execution times as opposed to other static task parallel formulations.

However, its workload prediction accuracy decreases as the numbers of processes

increases, leading to unbalanced loads. As a result, the authors introduced dynamic

load balancing in DTPF that monitors process workloads and reassigns work as

necessary.

Cong et al. took a sampling approach to accomplish static task partitioning in Par-

ASP [ 9 ], which they named selective sampling . After gathering 1-length frequent

sequence counts, they separate a sample

+

S ⊂ D

of k -length frequent prefixes of

D

sequences in

. The size of the prefix k is a function of the average length of

sequences in

. The authors then use one process to mine the sample via a pattern

growth algorithm, recording execution times for the found frequent sub-sequences.

D

Frequent Pattern Mining

Search WWH ::

Custom Search

Home