Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

load balancing and found that there was not a significant difference relative to the

rest of the computation time.

Di Fatta and Berthold [ 14 ] created a distributed implementation of MoFa us-

ing message passing. Dynamic load balance is achieved by receiver-initiated work

requests. When issuing a work request, a process must choose another worker to

request from. The authors present a policy called random-ranked polling , in which

each process maintains a list of all other processes, sorted by the time when their

current work unit was started. The requesting process randomly selects from the list,

with a bias towards those which have spent more time on the current work unit. A

process being asked for work first heuristically determines whether they should split

their work based on the stack size, the support, and branching factor of the pattern

currently being tested. If they are able to provide work to the requesting process, a

process sends one pattern from the bottom of their work stack. The requesting pro-

cess is then responsible for re-generating the embedding list of the received pattern

before processing it.

7

Conclusion

Many efficient serial algorithms have been developed for solving the frequent pattern

mining problem. Yet they often do not scale to the type of data we are presented with

today, the so-called “Big Data”. In this chapter, we gave an overview of parallel ap-

proaches for solving the problem, looking both at the initially defined frequent itemset

mining problem and at its extension to the sequence and graph mining domains. We

identified three areas as key challenges to parallel algorithmic design in the context of

frequent pattern mining: memory scalability, work partitioning, and load balancing.

With these challenges as a frame of reference, we extracted key algorithmic design

patterns from the wealth of research conducted in this domain. We found that, among

parallel candidate generation based algorithms, memory scalability is often the most

difficult obstacle to overcome, while for those parallel algorithms based on pattern

growth methods, load balance is typically the most critical consideration for efficient

parallel execution.

The parallel pattern mining problem is in no way “solved”. Many of the methods

presented here are more than a decade old and were designed for parallel architectures

very different than those that exist today. Moreover, they were not evaluated on

datasets big enough to show scalability to Big Data levels. While most works included

limited scalability studies, they generally did not compare their results against other

existing parallel algorithms for the same problem, even those designed for the same

architecture. More research is needed to validate existing methods at the Big Data

scale.

Work partitioning and load balancing continue to be open problems for parallel

frequent pattern mining. Better methods to estimate the cost of solving sub-problems

at each process can lessen the need for dynamic load balancing and improve overall

efficiency. Additionally, they can help processes intelligently decide whether to split

Search WWH ::

Custom Search

Home