Database Reference
In-Depth Information
frequent episodes and Srikant and Agrawal's generalized sequential patters. For an
in-depth survey of serial frequent pattern mining methods, the reader may consult
the work of Han et al. [ 22 ].
Motif discovery (finding frequent sub-sequence patterns) is an important problem
in the biological domain, especially as a sub-step in sequence alignment. Rigoutsos
and Floratos [ 49 ] proposed a pattern growth based method for motif discovery with
rigid gaps (sets of don't care items). Wang et al. [ 58 ] use a two step process for motif
discovery. They first search for short patterns with no gaps, called segments, and
then search for longer patterns made up of segments joined by variable length gaps.
Liao and Chen [ 31 ] use a vertical-database format they call the three-dimensional
list to speed up verification of patterns generated via direct spelling, i.e., extending
patterns at each level by each of the characters in the often small genetic symbol
alphabet.
5.2
Parallel Frequent Sequence Mining
In this section, we discuss a number of proposed parallel sequence mining algorithms,
along the lines of their memory scalability, task partitioning, and load balancing
choices.
5.2.1
Memory Scalability
Some algorithms aimed at distributed systems assume that either the set of input
sequences, the set of candidate sequences, or global count data structures fit in the
local memory of each process. This is an unrealistic expectation when mining Big
Data. Although their focus is on balancing mining tasks, Cong et al. [ 9 , 10 ] (Par-FP,
Par-ASP, Par-CSP) accomplished the task using a sampling technique that requires
the entire input set be available at each process. Shintani and Kitsuregawa [ 53 ]
partitioned the input set in Non Partitioned Sequential Pattern Mining (NPSPM), yet
they assumed that the entire candidate set can be replicated and will fit in the overall
memory (random access memory and hard drive) of a process. Similar assumptions
were made by Joshi et al. [ 27 ] in Event Distribution (EVE) and by Guralnik and
Karypis [ 18 ] in their Data Parallel Formulation (DPF).
These straight-forward data parallel formulations assign work to processes based
on the input data they have been assigned. This strategy does not scale well, as global
candidate sequences and their counts must also fit in the local process memory. Most
authors proposed alternatives that partition both input and intermediary data. Shintani
and Kitsuregawa [ 53 ] used a hash function in Hash Partitioned Sequential Pattern
Mining (HPSPM) to assign input and candidate sequences to specific processes.
Joshi et al. [ 27 ] proposed to shard both input and candidate sequences in Event and
Candidate Distribution (EVECAN) and rotate the smaller of the the two sets among
Search WWH ::




Custom Search