Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

Mining each frequent sub-sequence becomes a task. Projected databases for these

frequent sequences are computed during the mining time estimation. Task distribution

is then done in the same way as in the Par-FP algorithm, as described in Sect. 4 . The

authors found that the serial sampling component of their algorithm accounted on

average for 1.1 % of the serial mining time, which limits the speedup potential as the

number of processes increases.

Unlike in the general sequence mining problem, infrequent items cannot be re-

moved from projected sequences in the gap-constrained frequency mining problem.

Miliaraki et al. proposed several ways to compress projected databases in MG-FSM,

based on the concept of w-equivalency , i.e., the equivalency of projected databases

with respect to a pivot w . Projected databases are constructed and compressed for all

1-length frequent sequences in

in the MAP phase of a MapReduce job. Run-length

and variable-byte encoding are used to reduce the size of the projected databases be-

fore transmitting them to reducers, which provides a significant performance boost,

as reported by the authors in their experiments. A serial algorithm is used in the RE-

DUCE phase of the MapReduce job to mine the independent partitions of the sequence

database.

D

5.2.3

Dynamic Load Balancing

The advantage of separating work into partitions that processes can accomplish in-

dependently seems clear. Yet it is not always clear how long each partition will take

to mine. Dynamic load balancing aims to (re)-distribute the work as necessary when

processes finish their assigned tasks, in such a way that processes are allotted equal

amounts of work overall.

Zaki extended his static task partitioning scheme (SLB) in pSPADE by forming

a task queue. First level class nodes are entered into the queue in the same way as

they would have been assigned in SLB. In the inter-Class Dynamic Load Balancing

(CDLB) scheme, processes pick up new tasks from the queue, one at a time, as soon

as they finish their current work. An abnormally large sub-class may still lead to load

imbalance. The Recursive Dynamic Load Balancing (RDLB) scheme exploits both

inter and intra-class parallelism, allowing a process to share its work with free ones.

Mining of a class sub-tree at the process level takes place in a breath-first manner,

level by level. A process signals it is free via a global counter when it finishes its

current work and the queue is empty. Another process that still has work then enters

its unprocessed nodes on the level he is currently processing into the global queue,

providing work for the first process. The author reported RDLB performs best among

the three proposed schemes.

The static task-parallel formulation (STPF) presented by Guralnik and Karypis

may suffer some load imbalance as the number of processes increases. The authors

combat this through a receiver initiated load-balancing with random polling scheme

[ 17 ]. Processes are initially assigned tasks according to the STPF scheme. As a

process completes its currently assigned work, it sends a request for more work to

a randomly chosen other donor process. The donor process responds positively, by

Frequent Pattern Mining

Search WWH ::

Custom Search

Home