Information Technology Reference
In-Depth Information
introduces the use case. Our contributions are then presented in two parts; the first is
related to sequential pattern mining, then the second proposes an approach for multi-
dimensional sequences mining. Each part details the implementation and the
experimental results. The conclusion outlines the main contribution of the paper and
discusses its perspectives.
8.2 Background
8.2.1 State of the Art
Most works related to mining frequent sequences are in the field of customer transac-
tion analysis. Early work on frequent patterns - Apriori algorithm- only considered
transactions, not sequence of transactions [1]. This algorithm is costly because it car-
ries out multiple scans of the database to determine frequent subsets of items. Three
algorithms dealing with sequence of transactions are presented and compared in [2],
and [20]: AprioriAll , AprioriSome and DynamicSome . AprioriAll algorithm is an
adaptation of Apriori to sequences where candidate generation and support are com-
puted differently. AprioriAll , and AprioriSome only compute maximal frequent
sequences. Their principle is to jump to candidates of size k+next(k) in the next scan,
where next(k)>1. Maximum frequent sequences of lower size that have not been cal-
culated are given in the backward phase. The value of next(k) increases with Pk =
|Lk|/|Ck|, where Lk stands for frequent sequences of size k, and Ck the whole gener-
ated candidates of size k. DynamicSome algorithm is based on AprioriSome but uses a
jump by a multiple of user defined step. SPAM algorithm [10] uses a bitmap represen-
tation of transaction sequences once the entire database has been loaded in a lexico-
graphic tree. But this algorithm considers that the entire database and all used data
structures should completely fit into main memory, and then do not adapt for large
datasets. GSP algorithm [20] utilizes the property that all contiguous subsequences of
a frequent sequence also have to be frequent. As Apriori , it generates frequent se-
quences, then candidate sequences by adding one or more items. PrefixSPAN [14]
first finds the frequent items after scanning the database once. The sequence database
is then projected, according to the frequent items, into several smaller databases. Fi-
nally, all sequential patterns are found by recursively growing subsequence fragments
in each projected database. Employing a divide-and-conquer strategy with the Pat-
ternGrowth methodology, PrefixSPAN efficiently mines the complete set of patterns.
As for multidimensional sequences mining , it has been studied recently and
partially. Main works deal with frequent patterns that mix sequence items and dimen-
sions. The main contribution is from [16]. They have defined three algorithms for
mining multidimensional sequential patterns : UniSeq uses PrefixSPAN [14] to mine
multidimensional sequential patterns with sequences extended with dimensional in-
formation. Dim-Seq uses the BUC-like algorithm [4] to first mine multidimensional
patterns , then PrefixSPAN is used to mine the sequential patterns associated to the
multidimensional patterns . Seq-Dim first mines multidimensional patterns using
BUC-like algorithm, then uses PrefixSPAN to mine the associated sequential patterns .
Notice that the three algorithms produce the same result. However, they do not make
Search WWH ::




Custom Search