Database Reference
In-Depth Information
database, is then used to mine patterns in the lattice region. For example, mining
patterns in the suffix-based equivalence class of
{ b }
only requires data from tids 1,
2, 3, 4, 8, and 9, which contain
{ b }
or
{ ab }
as prefixes.
3
Paradigms for Big Data Computation
The challenges of working with Big Data are two-fold. First, dataset sizes have
increased much faster than the available memory of a workstation. The second chal-
lenge is the computation time required to find a solution. Computational parallelism
is an essential tool for managing the massive scale of today's data. It not only al-
lows one to operate on more data than could fit on a single machine, but also gives
speedup opportunities for computationally intensive applications. In the remainder
of this section we briefly discuss the principles of parallel algorithm design, out-
lining some challenges specific to the frequent pattern mining problem, and then
detail general approaches for addressing these challenges in shared and distributed
memory systems.
3.1
Principles of Parallel Algorithms
Designing a parallel algorithm is not an easy prospect. In addition to all of the
challenges associated with serial algorithm design, there are a host of issues spe-
cific to parallel computation that must be considered. We will briefly discuss the
topics of memory scalability, work partitioning, and load balancing. For a more
comprehensive look at parallel algorithm design, we refer the reader to Grama et al.
[ 17 ].
As one might imagine, extending the serial FIM methods to parallel systems need
not be difficult. For example, a serial candidate generation based algorithm can be
made parallel by replicating the list of transactions
at each process, and having
each process compute the support for a subset of candidate itemsets in a globally
accessible hash tree. These “direct” extensions however, rely on assumptions like
unlimited process memory and concurrent read/concurrent write architecture, which
ignore the three challenges outlined at the outset of this section.
One of the key factors in choosing to use parallel algorithms in lieu of their
serial counterparts is data that is too large to fit in memory on a single workstation.
Even while the input dataset may fit in memory, intermediary data such as candidate
patterns and their counts, or data structures used during frequent pattern mining,
may not. Memory scalability is essential when working with Big Data as it allows
an application to cope with very large datasets by increasing parallelism. We call
an algorithm memory scalable if the required memory per process is a function of
( p )
T
+
O ( p ), where n is the size of the input data and p is the number of processes
executed in parallel. As the number of processes grows, the required amount of
memory per process for a memory scalable algorithm decreases. A challenge in
Search WWH ::




Custom Search