Database Reference
In-Depth Information
In ParEclat, Zaki et al. [ 67 ] employed a greedy algorithm that distributes work
evenly among processes. First, each equivalence class is given a weight based on
its cardinality. This weight is then used to distribute classes so that all processes are
assigned subsets of classes with even weight.
A more advanced work distribution scheme for pattern growth methods was pro-
posed by Cong et al. in [ 9 ], for their Par-FP algorithm. The authors suggested the
use of selective sampling as a preprocessing step to estimate the cost for FP-Growth
execution rooted at a particular item. In selective sampling, the t most frequent items,
where t is an application parameter, are discarded from all transactions, along with
any infrequent items. An FP-Tree is built from the resulting set of items and mined
by a single process, recording the execution time of mining the projected database
of each sampled item. After the sampling step, the work of mining each item i
I is
partitioned in the following way. First, of the sampled items, any identified as large
are further split into subtasks which are assigned to processes in a round-robin fash-
ion. A large item is defined as any item whose sample execution time is significantly
higher than the expected execution time given an equal distribution of work. Then,
all other sampled items are assigned using a binning algorithm based on their sample
execution time. Finally, those frequent items which were not sampled are assigned
to processes in a round-robin fashion.
4.3
Dynamic Load Balancing
The estimation of work execution cost in the context of FIM can be done in a
reasonably accurate way on homogeneous systems through work partitioning, as
a preprocessing step to the algorithm execution. MapReduce systems are often
heterogeneous, often composed of nodes with diverse compute power or resource
availability. Lin et al. [ 33 ] noted that the number of candidate itemsets is typically
small during later iterations of Apriori in his SPC algorithm, leading to increased
overhead due to startup and scheduling of MapReduce jobs. Therefore, the authors
proposed two dynamic load balancing improvements to SPC, the FPC and DPC algo-
rithms. FPC combines a fixed number of Apriori iterations into a single pass, which
means less MapReduce overhead as well as less scans over the list of transactions,
but can also lead to many false positive results. The DPC algorithm dynamically
chooses how many iterations to combine during each MAP phase, via two heuristics.
First, in a single MAP phase, the DPC algorithm will combine as many iterations as
it can without exceeding a specified maximum number of candidate itemsets. Since
MapReduce systems can be heterogeneous, it is possible that the appropriate thresh-
old for one compute node leads to a significantly longer or shorter execution time
on a different node for the same iteration. The second heuristic of DPC dynamically
adjusts the threshold value based on the execution times from previous iterations. In
this way, the DPC algorithm can accommodate systems with dynamically changing
resource availability.
Search WWH ::




Custom Search