Big Data Frequent Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

frequent episodes and Srikant and Agrawal's generalized sequential patters. For an

in-depth survey of serial frequent pattern mining methods, the reader may consult

the work of Han et al. [ 22 ].

Motif discovery (finding frequent sub-sequence patterns) is an important problem

in the biological domain, especially as a sub-step in sequence alignment. Rigoutsos

and Floratos [ 49 ] proposed a pattern growth based method for motif discovery with

rigid gaps (sets of don't care items). Wang et al. [ 58 ] use a two step process for motif

discovery. They first search for short patterns with no gaps, called segments, and

then search for longer patterns made up of segments joined by variable length gaps.

Liao and Chen [ 31 ] use a vertical-database format they call the three-dimensional

list to speed up verification of patterns generated via direct spelling, i.e., extending

patterns at each level by each of the characters in the often small genetic symbol

alphabet.

5.2

Parallel Frequent Sequence Mining

In this section, we discuss a number of proposed parallel sequence mining algorithms,

along the lines of their memory scalability, task partitioning, and load balancing

choices.

5.2.1

Memory Scalability

Some algorithms aimed at distributed systems assume that either the set of input

sequences, the set of candidate sequences, or global count data structures fit in the

local memory of each process. This is an unrealistic expectation when mining Big

Data. Although their focus is on balancing mining tasks, Cong et al. [ 9 , 10 ] (Par-FP,

Par-ASP, Par-CSP) accomplished the task using a sampling technique that requires

the entire input set be available at each process. Shintani and Kitsuregawa [ 53 ]

partitioned the input set in Non Partitioned Sequential Pattern Mining (NPSPM), yet

they assumed that the entire candidate set can be replicated and will fit in the overall

memory (random access memory and hard drive) of a process. Similar assumptions

were made by Joshi et al. [ 27 ] in Event Distribution (EVE) and by Guralnik and

Karypis [ 18 ] in their Data Parallel Formulation (DPF).

These straight-forward data parallel formulations assign work to processes based

on the input data they have been assigned. This strategy does not scale well, as global

candidate sequences and their counts must also fit in the local process memory. Most

authors proposed alternatives that partition both input and intermediary data. Shintani

and Kitsuregawa [ 53 ] used a hash function in Hash Partitioned Sequential Pattern

Mining (HPSPM) to assign input and candidate sequences to specific processes.

Joshi et al. [ 27 ] proposed to shard both input and candidate sequences in Event and

Candidate Distribution (EVECAN) and rotate the smaller of the the two sets among

Search WWH ::

Custom Search

Home