Database Reference
In-Depth Information
Solving the FSM problem via lexicographic tree projection in a distributed
message-passing environment, Guralnik and Karypis proposed a Data Parallel For-
mulation (DPF) similar to the Count Distribution strategy described in Sect. 4 .It
partitions the input equally among the processes and requires each node to keep
track of the lexicographic tree representing frequent candidates and the associated
count matrices. Each node projects its local set of sequences unto level ( k
1) of
the tree to determine local supports for candidate itemsets at the ( k
1)th level.
A reduction operation facilitates computing global support of candidate itemsets.
Those meeting the minimum support threshold σ are then broadcast to all processes
before continuing to the next level. For problems with large numbers of candidates,
where the global tree and its associated count matrices do not fit in the local memory
of a process, DPF has to partition the tree and perform multiple scans of the local
database at each iteration. This added I/O overhead makes this formulation imprac-
tical in the Big Data context. The authors also showed that DPF's parallel efficiency
will decrease as the number of processes increase, even when the amount of work in-
creases, due to the overhead of maintaining the global tree at each process. To account
for these limitations, they propose Static and Dynamic Task-Parallel Formulations
(STPF and DTPF), which partition both the input data and the lexicographic tree
during computation.
Both task-parallel formulations first use DPF to expand the tree up to a level
+
k
1)-level node requires count data from its
parent node. Therefore, nodes at level k are partitioned among the processes for fur-
ther processing, along with their projected databases . A node's projected database
contains a subset of the sequence database
+
1, k> 0. Further expanding a ( k
+
s.t. each item in the itemsets of each
transaction is still viable to extend the sequence represented by the node into a pos-
sible frequent sequence. These viable items are called active items . The question
remains how to partition the k -level nodes among the processes. The authors pro-
posed two static approaches. The first uses a bin-packing algorithm based on relative
computation time estimates for expanding the nodes, computed as the sum of their
children's corresponding sequential patterns' support. The second approach aims to
minimize the overlap in the nodes' projected databases via repeated minimum-cut
partitioning of the bipartite graph formed by the set of nodes and the set of itemsets
at level ( k
D
1) in the tree. The authors found that the bipartite graph partition-
ing approach is able to substantially reduce the overlap in projected databases and
lead to smaller execution times as opposed to other static task parallel formulations.
However, its workload prediction accuracy decreases as the numbers of processes
increases, leading to unbalanced loads. As a result, the authors introduced dynamic
load balancing in DTPF that monitors process workloads and reassigns work as
necessary.
Cong et al. took a sampling approach to accomplish static task partitioning in Par-
ASP [ 9 ], which they named selective sampling . After gathering 1-length frequent
sequence counts, they separate a sample
+
S D
of k -length frequent prefixes of
D
sequences in
. The size of the prefix k is a function of the average length of
sequences in
. The authors then use one process to mine the sample via a pattern
growth algorithm, recording execution times for the found frequent sub-sequences.
D
Search WWH ::




Custom Search