Database Reference
In-Depth Information
load balancing and found that there was not a significant difference relative to the
rest of the computation time.
Di Fatta and Berthold [ 14 ] created a distributed implementation of MoFa us-
ing message passing. Dynamic load balance is achieved by receiver-initiated work
requests. When issuing a work request, a process must choose another worker to
request from. The authors present a policy called random-ranked polling , in which
each process maintains a list of all other processes, sorted by the time when their
current work unit was started. The requesting process randomly selects from the list,
with a bias towards those which have spent more time on the current work unit. A
process being asked for work first heuristically determines whether they should split
their work based on the stack size, the support, and branching factor of the pattern
currently being tested. If they are able to provide work to the requesting process, a
process sends one pattern from the bottom of their work stack. The requesting pro-
cess is then responsible for re-generating the embedding list of the received pattern
before processing it.
7
Conclusion
Many efficient serial algorithms have been developed for solving the frequent pattern
mining problem. Yet they often do not scale to the type of data we are presented with
today, the so-called “Big Data”. In this chapter, we gave an overview of parallel ap-
proaches for solving the problem, looking both at the initially defined frequent itemset
mining problem and at its extension to the sequence and graph mining domains. We
identified three areas as key challenges to parallel algorithmic design in the context of
frequent pattern mining: memory scalability, work partitioning, and load balancing.
With these challenges as a frame of reference, we extracted key algorithmic design
patterns from the wealth of research conducted in this domain. We found that, among
parallel candidate generation based algorithms, memory scalability is often the most
difficult obstacle to overcome, while for those parallel algorithms based on pattern
growth methods, load balance is typically the most critical consideration for efficient
parallel execution.
The parallel pattern mining problem is in no way “solved”. Many of the methods
presented here are more than a decade old and were designed for parallel architectures
very different than those that exist today. Moreover, they were not evaluated on
datasets big enough to show scalability to Big Data levels. While most works included
limited scalability studies, they generally did not compare their results against other
existing parallel algorithms for the same problem, even those designed for the same
architecture. More research is needed to validate existing methods at the Big Data
scale.
Work partitioning and load balancing continue to be open problems for parallel
frequent pattern mining. Better methods to estimate the cost of solving sub-problems
at each process can lessen the need for dynamic load balancing and improve overall
efficiency. Additionally, they can help processes intelligently decide whether to split
Search WWH ::




Custom Search