Database Reference
In-Depth Information
assigned to it, each process needs access to all of the tid-lists belonging to items in
its assigned classes. Because the equivalence classes of ECLAT can be processed in-
dependently, no communication is needed after the initial class distribution. In many
cases, this vertical exploration of the itemset lattice, as opposed to the horizontal ex-
ploration of Apriori based approaches, leads to much smaller memory requirements.
However, ParEclat makes an implicit assumption that the tid-lists associated with the
subset of classes being explored fits in the main memory of a single process, which
may not always be the case. Consider the case that some subset of items appears in
a significant portion of
, then the corresponding tid-lists for these items will be a
large fraction of the original list of transactions. Based on our original assumption
that
T
does not fit in the memory of a single process, it can be concluded that ParEclat
is memory scalable for only certain classes of input.
Another method similar to ParEclat was introduced by Cong et al. [ 9 ]. Their
method, called Par-FP, follows the same parallelization strategy as ParEclat, but uses
standard FP-Growth as its underlying algorithm instead of ECLAT. Like ParEclat,
Par-FP is also a pattern generation method and is thus formulated within similar
memory assumptions. Moens et al. [ 39 ] proposed Dist-Eclat and BigFIM in the
MapReduce context. Dist-Eclat follows the same approach as ParEclat, adapted to
the MapReduce framework. BigFIM is a hybrid approach that uses a MapReduce
equivalent of CD to compute candidate itemsets level-wise until each process' as-
signed set of candidate itemsets no longer fits in its memory, at which point the
algorithm transitions to a pattern growth approach.
Zaki et al. [ 66 ] proposed shared memory implementations of both the CD algo-
rithm and the DD algorithm. In their implementation of the CD algorithm, which
is referred to as Common Candidate Partitioned Database (CCPD), each process
computes support for all candidates with respect to a distinct partition of the list
of transactions. The support counts are aggregated in a shared candidate hash tree,
whose access is controlled with a locking mechanism at the leaf of each node. The
Partitioned Candidate Common Database (PCCD) algorithm is the shared memory
implementation of the DD algorithm. Zaki et al. observed that, although there is no
locking required, the increased disk contention from all processes scanning a shared
list of transactions leads to a slowdown when using more than one process, due to
processes operating on disjoint candidate hash trees.
T
4.2
Work Partitioning
A limitation of the DD algorithm is that it leads to redundant work when compared
with CD, which performs the same amount of work as its serial counterpart. Let the
cost of incrementing the count for a candidate itemset that exists in a process' hash
tree be a function f ( X ) of the number of unique candidate itemsets stored in the hash
tree X . Then, in the case of the CD algorithm, the function becomes f ( M k ), where
M k is the total number of candidate itemsets of a given size k . In contrast, for the
DD algorithm, the cost is f ( M k /p ). Furthermore, in the CD algorithm, each process
Search WWH ::




Custom Search