Database Reference
In-Depth Information
set is called L d 1 . These frequent 1-sequences can be used as seeds to discover new
frequent sequences of the following types: (1) previous frequent sequences in DB
which can be extended by items of L db
1
; (2) frequent subsequences in DB which are
predecessor items in L db
1 ; (3) candidate sequences generated from L d 1 . This process
is continued iteratively (since after the first step frequent 2-sequences are obtained
to be used in the same manner as described before) until no more candidates are
generated.
In [ 34 ], an efficient algorithm, called IncSpan, is developed, for incremental
mining over multiple database increments. Several novel ideas are introduced in
the algorithm development: (1) maintaining a set of “almost frequent" sequences
as the candidates in the updated database, which has several nice properties and
leads to efficient techniques, and (2) two optimization techniques, reverse pattern
matching and shared projection , are designed to improve the performance. Reverse
pattern matching is used for matching a sequential pattern in a sequence and prune
some search space. Shared projection is designed to reduce the number of database
projections for some sequences which share a common prefix [ 34 ].
Gao et al. [ 35 ] proposed an efficient incremental algorithm called StreamCloSeq
for mining closed sequential patterns over stream data, where new data arrives con-
tinuously and the data distribution often evolves over time. StreamCloSeq stores
only frequent closed sequence prefixes in the enumeration tree, used for mining and
maintaining patterns in the current sliding window, to solve the frequent closed se-
quential pattern mining problem efficiently over stream data. Some novel effective
search space pruning and pattern closure checking strategies have been also devised
to accelerate the algorithm. For the details of this algorithm, you could refer to [ 35 ].
5.4
Hybrid Methods
DISC-all algorithm [ 36 ] combines the candidate sequence pruning, database parti-
tioning and projection with a strategy called DIrect Sequence Comparison (abbre-
viated as DISC ), which can find frequent sequences of a specific length k without
having to compute the support counts of non-frequent sequences. The DISC strategy
defines the order of two sequences having the same length using lexicographical
ordering and temporal ordering. For example,
because in the second transactions, b is smaller than c owing to the lexicographical
ordering. Furthermore,
( a )( b )( h )
is smaller than
( a )( c )( f )
( a , b )( c )
is smaller than
( a )( b , c )
because of the temporal
ordering of itemsets in the first transactions.
According to this defined sequence ordering, if t k is the smallest k -length subse-
quence of a customer sequence, t k is called k-minimum subsequence . Then, DISC-all
algorithm sorts customer sequences by the order of their associated k-minimum
subsequences, which compose a k-sorted database . In a k-sorted database, the k-
minimum subsequence at the first position is denoted by α 1 and the k-minimum
subsequence at the δ -th position is denoted by α δ , where δ is the minimum support
count. These two k-minimum subsequences are compared and if they are equal then
Search WWH ::




Custom Search