Sequential Pattern Mining - Frequent Pattern Mining

Database Reference

In-Depth Information

set is called L d 1 . These frequent 1-sequences can be used as seeds to discover new

frequent sequences of the following types: (1) previous frequent sequences in DB

which can be extended by items of L db

1

; (2) frequent subsequences in DB which are

predecessor items in L db

1 ; (3) candidate sequences generated from L d 1 . This process

is continued iteratively (since after the first step frequent 2-sequences are obtained

to be used in the same manner as described before) until no more candidates are

generated.

In [ 34 ], an efficient algorithm, called IncSpan, is developed, for incremental

mining over multiple database increments. Several novel ideas are introduced in

the algorithm development: (1) maintaining a set of “almost frequent" sequences

as the candidates in the updated database, which has several nice properties and

leads to efficient techniques, and (2) two optimization techniques, reverse pattern

matching and shared projection , are designed to improve the performance. Reverse

pattern matching is used for matching a sequential pattern in a sequence and prune

some search space. Shared projection is designed to reduce the number of database

projections for some sequences which share a common prefix [ 34 ].

Gao et al. [ 35 ] proposed an efficient incremental algorithm called StreamCloSeq

for mining closed sequential patterns over stream data, where new data arrives con-

tinuously and the data distribution often evolves over time. StreamCloSeq stores

only frequent closed sequence prefixes in the enumeration tree, used for mining and

maintaining patterns in the current sliding window, to solve the frequent closed se-

quential pattern mining problem efficiently over stream data. Some novel effective

search space pruning and pattern closure checking strategies have been also devised

to accelerate the algorithm. For the details of this algorithm, you could refer to [ 35 ].

5.4

Hybrid Methods

DISC-all algorithm [ 36 ] combines the candidate sequence pruning, database parti-

tioning and projection with a strategy called DIrect Sequence Comparison (abbre-

viated as DISC ), which can find frequent sequences of a specific length k without

having to compute the support counts of non-frequent sequences. The DISC strategy

defines the order of two sequences having the same length using lexicographical

ordering and temporal ordering. For example,

because in the second transactions, b is smaller than c owing to the lexicographical

ordering. Furthermore,

( a )( b )( h )

is smaller than

( a )( c )( f )

( a , b )( c )

is smaller than

( a )( b , c )

because of the temporal

ordering of itemsets in the first transactions.

According to this defined sequence ordering, if t k is the smallest k -length subse-

quence of a customer sequence, t k is called k-minimum subsequence . Then, DISC-all

algorithm sorts customer sequences by the order of their associated k-minimum

subsequences, which compose a k-sorted database . In a k-sorted database, the k-

minimum subsequence at the first position is denoted by α 1 and the k-minimum

subsequence at the δ -th position is denoted by α δ , where δ is the minimum support

count. These two k-minimum subsequences are compared and if they are equal then

Frequent Pattern Mining

Search WWH ::

Custom Search

Home