Database Reference
In-Depth Information
Mining each frequent sub-sequence becomes a task. Projected databases for these
frequent sequences are computed during the mining time estimation. Task distribution
is then done in the same way as in the Par-FP algorithm, as described in Sect. 4 . The
authors found that the serial sampling component of their algorithm accounted on
average for 1.1 % of the serial mining time, which limits the speedup potential as the
number of processes increases.
Unlike in the general sequence mining problem, infrequent items cannot be re-
moved from projected sequences in the gap-constrained frequency mining problem.
Miliaraki et al. proposed several ways to compress projected databases in MG-FSM,
based on the concept of w-equivalency , i.e., the equivalency of projected databases
with respect to a pivot w . Projected databases are constructed and compressed for all
1-length frequent sequences in
in the MAP phase of a MapReduce job. Run-length
and variable-byte encoding are used to reduce the size of the projected databases be-
fore transmitting them to reducers, which provides a significant performance boost,
as reported by the authors in their experiments. A serial algorithm is used in the RE-
DUCE phase of the MapReduce job to mine the independent partitions of the sequence
database.
D
5.2.3
Dynamic Load Balancing
The advantage of separating work into partitions that processes can accomplish in-
dependently seems clear. Yet it is not always clear how long each partition will take
to mine. Dynamic load balancing aims to (re)-distribute the work as necessary when
processes finish their assigned tasks, in such a way that processes are allotted equal
amounts of work overall.
Zaki extended his static task partitioning scheme (SLB) in pSPADE by forming
a task queue. First level class nodes are entered into the queue in the same way as
they would have been assigned in SLB. In the inter-Class Dynamic Load Balancing
(CDLB) scheme, processes pick up new tasks from the queue, one at a time, as soon
as they finish their current work. An abnormally large sub-class may still lead to load
imbalance. The Recursive Dynamic Load Balancing (RDLB) scheme exploits both
inter and intra-class parallelism, allowing a process to share its work with free ones.
Mining of a class sub-tree at the process level takes place in a breath-first manner,
level by level. A process signals it is free via a global counter when it finishes its
current work and the queue is empty. Another process that still has work then enters
its unprocessed nodes on the level he is currently processing into the global queue,
providing work for the first process. The author reported RDLB performs best among
the three proposed schemes.
The static task-parallel formulation (STPF) presented by Guralnik and Karypis
may suffer some load imbalance as the number of processes increases. The authors
combat this through a receiver initiated load-balancing with random polling scheme
[ 17 ]. Processes are initially assigned tasks according to the STPF scheme. As a
process completes its currently assigned work, it sends a request for more work to
a randomly chosen other donor process. The donor process responds positively, by
Search WWH ::




Custom Search