Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

assigned to it, each process needs access to all of the tid-lists belonging to items in

its assigned classes. Because the equivalence classes of ECLAT can be processed in-

dependently, no communication is needed after the initial class distribution. In many

cases, this vertical exploration of the itemset lattice, as opposed to the horizontal ex-

ploration of Apriori based approaches, leads to much smaller memory requirements.

However, ParEclat makes an implicit assumption that the tid-lists associated with the

subset of classes being explored fits in the main memory of a single process, which

may not always be the case. Consider the case that some subset of items appears in

a significant portion of

, then the corresponding tid-lists for these items will be a

large fraction of the original list of transactions. Based on our original assumption

that

T

does not fit in the memory of a single process, it can be concluded that ParEclat

is memory scalable for only certain classes of input.

Another method similar to ParEclat was introduced by Cong et al. [ 9 ]. Their

method, called Par-FP, follows the same parallelization strategy as ParEclat, but uses

standard FP-Growth as its underlying algorithm instead of ECLAT. Like ParEclat,

Par-FP is also a pattern generation method and is thus formulated within similar

memory assumptions. Moens et al. [ 39 ] proposed Dist-Eclat and BigFIM in the

MapReduce context. Dist-Eclat follows the same approach as ParEclat, adapted to

the MapReduce framework. BigFIM is a hybrid approach that uses a MapReduce

equivalent of CD to compute candidate itemsets level-wise until each process' as-

signed set of candidate itemsets no longer fits in its memory, at which point the

algorithm transitions to a pattern growth approach.

Zaki et al. [ 66 ] proposed shared memory implementations of both the CD algo-

rithm and the DD algorithm. In their implementation of the CD algorithm, which

is referred to as Common Candidate Partitioned Database (CCPD), each process

computes support for all candidates with respect to a distinct partition of the list

of transactions. The support counts are aggregated in a shared candidate hash tree,

whose access is controlled with a locking mechanism at the leaf of each node. The

Partitioned Candidate Common Database (PCCD) algorithm is the shared memory

implementation of the DD algorithm. Zaki et al. observed that, although there is no

locking required, the increased disk contention from all processes scanning a shared

list of transactions leads to a slowdown when using more than one process, due to

processes operating on disjoint candidate hash trees.

T

4.2

Work Partitioning

A limitation of the DD algorithm is that it leads to redundant work when compared

with CD, which performs the same amount of work as its serial counterpart. Let the

cost of incrementing the count for a candidate itemset that exists in a process' hash

tree be a function f ( X ) of the number of unique candidate itemsets stored in the hash

tree X . Then, in the case of the CD algorithm, the function becomes f ( M k ), where

M k is the total number of candidate itemsets of a given size k . In contrast, for the

DD algorithm, the cost is f ( M k /p ). Furthermore, in the CD algorithm, each process

Frequent Pattern Mining

Search WWH ::

Custom Search

Home