Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

database, is then used to mine patterns in the lattice region. For example, mining

patterns in the suffix-based equivalence class of

{ b }

only requires data from tids 1,

2, 3, 4, 8, and 9, which contain

{ b }

or

{ ab }

as prefixes.

3

Paradigms for Big Data Computation

The challenges of working with Big Data are two-fold. First, dataset sizes have

increased much faster than the available memory of a workstation. The second chal-

lenge is the computation time required to find a solution. Computational parallelism

is an essential tool for managing the massive scale of today's data. It not only al-

lows one to operate on more data than could fit on a single machine, but also gives

speedup opportunities for computationally intensive applications. In the remainder

of this section we briefly discuss the principles of parallel algorithm design, out-

lining some challenges specific to the frequent pattern mining problem, and then

detail general approaches for addressing these challenges in shared and distributed

memory systems.

3.1

Principles of Parallel Algorithms

Designing a parallel algorithm is not an easy prospect. In addition to all of the

challenges associated with serial algorithm design, there are a host of issues spe-

cific to parallel computation that must be considered. We will briefly discuss the

topics of memory scalability, work partitioning, and load balancing. For a more

comprehensive look at parallel algorithm design, we refer the reader to Grama et al.

[ 17 ].

As one might imagine, extending the serial FIM methods to parallel systems need

not be difficult. For example, a serial candidate generation based algorithm can be

made parallel by replicating the list of transactions

at each process, and having

each process compute the support for a subset of candidate itemsets in a globally

accessible hash tree. These “direct” extensions however, rely on assumptions like

unlimited process memory and concurrent read/concurrent write architecture, which

ignore the three challenges outlined at the outset of this section.

One of the key factors in choosing to use parallel algorithms in lieu of their

serial counterparts is data that is too large to fit in memory on a single workstation.

Even while the input dataset may fit in memory, intermediary data such as candidate

patterns and their counts, or data structures used during frequent pattern mining,

may not. Memory scalability is essential when working with Big Data as it allows

an application to cope with very large datasets by increasing parallelism. We call

an algorithm memory scalable if the required memory per process is a function of

( p )

T

+

O ( p ), where n is the size of the input data and p is the number of processes

executed in parallel. As the number of processes grows, the required amount of

memory per process for a memory scalable algorithm decreases. A challenge in

Frequent Pattern Mining

Search WWH ::

Custom Search

Home