Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

In ParEclat, Zaki et al. [ 67 ] employed a greedy algorithm that distributes work

evenly among processes. First, each equivalence class is given a weight based on

its cardinality. This weight is then used to distribute classes so that all processes are

assigned subsets of classes with even weight.

A more advanced work distribution scheme for pattern growth methods was pro-

posed by Cong et al. in [ 9 ], for their Par-FP algorithm. The authors suggested the

use of selective sampling as a preprocessing step to estimate the cost for FP-Growth

execution rooted at a particular item. In selective sampling, the t most frequent items,

where t is an application parameter, are discarded from all transactions, along with

any infrequent items. An FP-Tree is built from the resulting set of items and mined

by a single process, recording the execution time of mining the projected database

of each sampled item. After the sampling step, the work of mining each item i

I is

partitioned in the following way. First, of the sampled items, any identified as large

are further split into subtasks which are assigned to processes in a round-robin fash-

ion. A large item is defined as any item whose sample execution time is significantly

higher than the expected execution time given an equal distribution of work. Then,

all other sampled items are assigned using a binning algorithm based on their sample

execution time. Finally, those frequent items which were not sampled are assigned

to processes in a round-robin fashion.

∈

4.3

Dynamic Load Balancing

The estimation of work execution cost in the context of FIM can be done in a

reasonably accurate way on homogeneous systems through work partitioning, as

a preprocessing step to the algorithm execution. MapReduce systems are often

heterogeneous, often composed of nodes with diverse compute power or resource

availability. Lin et al. [ 33 ] noted that the number of candidate itemsets is typically

small during later iterations of Apriori in his SPC algorithm, leading to increased

overhead due to startup and scheduling of MapReduce jobs. Therefore, the authors

proposed two dynamic load balancing improvements to SPC, the FPC and DPC algo-

rithms. FPC combines a fixed number of Apriori iterations into a single pass, which

means less MapReduce overhead as well as less scans over the list of transactions,

but can also lead to many false positive results. The DPC algorithm dynamically

chooses how many iterations to combine during each MAP phase, via two heuristics.

First, in a single MAP phase, the DPC algorithm will combine as many iterations as

it can without exceeding a specified maximum number of candidate itemsets. Since

MapReduce systems can be heterogeneous, it is possible that the appropriate thresh-

old for one compute node leads to a significantly longer or shorter execution time

on a different node for the same iteration. The second heuristic of DPC dynamically

adjusts the threshold value based on the execution times from previous iterations. In

this way, the DPC algorithm can accommodate systems with dynamically changing

resource availability.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home