Database Reference
In-Depth Information
iteratively merge them into larger ones until no new patterns can be found. From the
perspective of the pattern lattice model, pattern merge approach would traverse the
pattern lattice in a leaping fashion, jumping from a smaller size pattern to a much
larger one directly, and attempt to reach larger patterns as quickly as possible. As they
do not attempt to examine the entire pattern candidate space, algorithms of pattern
merge approach usually do not aim to return the complete set of long patterns over
a certain size constraint. Rather, they either set out to find the top- k largest patterns
and guarantee their success by probabilistic arguments, or strive to capture as many
long pattern as possible in an best-effort style. Depending on the aggressiveness of
the merging, these algorithms can be further put into two types—Piece-wise Merge
and Fusion-style Merge.
6.1
Piece-wise Pattern Merge
Piece-wise Pattern Merge algorithms would try to merge two smaller patterns found
so far to generate a larger new one. We show two examples here for sequence mining
and graph mining respectively.
Long Approximate Sequential Pattern In [ 32 ], a piece-wise pattern merge style
algorithm is proposed to find the complete set of closed frequent approximate se-
quential patterns defined with Hamming distance. Hamming distance, defined for
two strings of equal length, is the number of substitutions required to change one
into the other. Frequent approximate substrings are defined as follows.
Definition 4.7 (Frequent Approximate Substring (FAS)) Given a string S, a
minimum frequency threshold θ and an error tolerance threshold δ, a substring P of
S is a frequent approximate substring if and only if there exists a set U of substrings
of S and for each W
U , H ammingDist ( P , W )
≤|
P
|
δ , and
|
U
|≥
θ . U is called
the support set of P, denoted as P sup .
U is represented as a set of indices of S as all substrings in U share the same length
as P . Given a input string S , we are interested in finding all frequent approximate
substrings of S , i.e., for each such substring, the set of substrings that are considered
approximately the same must be sufficiently large. To reduce redundancy in the
result, a notion of closed frequent approximate substring is also proposed [ 32 ].
The design of the algorithm relies on a notion of a strand , which is a set of
substrings that share one same matching pattern. Consider four substrings S 1 , S 2 , S 3
and S 4 as shown in Fig. 4.5 .
All four substrings are of length 20. If the error tolerance threshold δ
=
0 . 1
and minimum frequency threshold θ
4, then S 2 is a FAS since the other
three substrings are within Hamming distance 2 from S 2 . For each substring, the
bounding boxes indicate the parts that match exactly with S 2 . In this case, S 1
and S 2 have the same matching patterns. S 2 , S 3 and S 4 have the same matching
patterns. In general, aligning any two substrings W 1 and W 2 , one can observe
an alternating sequence of maximal matching substrings and gaps of mismatches
=
Search WWH ::




Custom Search