Mining Long Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

iteratively merge them into larger ones until no new patterns can be found. From the

perspective of the pattern lattice model, pattern merge approach would traverse the

pattern lattice in a leaping fashion, jumping from a smaller size pattern to a much

larger one directly, and attempt to reach larger patterns as quickly as possible. As they

do not attempt to examine the entire pattern candidate space, algorithms of pattern

merge approach usually do not aim to return the complete set of long patterns over

a certain size constraint. Rather, they either set out to find the top- k largest patterns

and guarantee their success by probabilistic arguments, or strive to capture as many

long pattern as possible in an best-effort style. Depending on the aggressiveness of

the merging, these algorithms can be further put into two types—Piece-wise Merge

and Fusion-style Merge.

6.1

Piece-wise Pattern Merge

Piece-wise Pattern Merge algorithms would try to merge two smaller patterns found

so far to generate a larger new one. We show two examples here for sequence mining

and graph mining respectively.

Long Approximate Sequential Pattern In [ 32 ], a piece-wise pattern merge style

algorithm is proposed to find the complete set of closed frequent approximate se-

quential patterns defined with Hamming distance. Hamming distance, defined for

two strings of equal length, is the number of substitutions required to change one

into the other. Frequent approximate substrings are defined as follows.

Definition 4.7 (Frequent Approximate Substring (FAS)) Given a string S, a

minimum frequency threshold θ and an error tolerance threshold δ, a substring P of

S is a frequent approximate substring if and only if there exists a set U of substrings

of S and for each W

∈

U , H ammingDist ( P , W )

≤|

P

|

δ , and

|

U

|≥

θ . U is called

the support set of P, denoted as P sup .

U is represented as a set of indices of S as all substrings in U share the same length

as P . Given a input string S , we are interested in finding all frequent approximate

substrings of S , i.e., for each such substring, the set of substrings that are considered

approximately the same must be sufficiently large. To reduce redundancy in the

result, a notion of closed frequent approximate substring is also proposed [ 32 ].

The design of the algorithm relies on a notion of a strand , which is a set of

substrings that share one same matching pattern. Consider four substrings S 1 , S 2 , S 3

and S 4 as shown in Fig. 4.5 .

All four substrings are of length 20. If the error tolerance threshold δ

=

0 . 1

and minimum frequency threshold θ

4, then S 2 is a FAS since the other

three substrings are within Hamming distance 2 from S 2 . For each substring, the

bounding boxes indicate the parts that match exactly with S 2 . In this case, S 1

and S 2 have the same matching patterns. S 2 , S 3 and S 4 have the same matching

patterns. In general, aligning any two substrings W 1 and W 2 , one can observe

an alternating sequence of maximal matching substrings and gaps of mismatches

=

Frequent Pattern Mining

Search WWH ::

Custom Search

Home