From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

introduces the use case. Our contributions are then presented in two parts; the first is

related to sequential pattern mining, then the second proposes an approach for multi-

dimensional sequences mining. Each part details the implementation and the

experimental results. The conclusion outlines the main contribution of the paper and

discusses its perspectives.

8.2 Background

8.2.1 State of the Art

Most works related to mining frequent sequences are in the field of customer transac-

tion analysis. Early work on frequent patterns - Apriori algorithm- only considered

transactions, not sequence of transactions [1]. This algorithm is costly because it car-

ries out multiple scans of the database to determine frequent subsets of items. Three

algorithms dealing with sequence of transactions are presented and compared in [2],

and [20]: AprioriAll , AprioriSome and DynamicSome . AprioriAll algorithm is an

adaptation of Apriori to sequences where candidate generation and support are com-

puted differently. AprioriAll , and AprioriSome only compute maximal frequent

sequences. Their principle is to jump to candidates of size k+next(k) in the next scan,

where next(k)>1. Maximum frequent sequences of lower size that have not been cal-

culated are given in the backward phase. The value of next(k) increases with Pk =

|Lk|/|Ck|, where Lk stands for frequent sequences of size k, and Ck the whole gener-

ated candidates of size k. DynamicSome algorithm is based on AprioriSome but uses a

jump by a multiple of user defined step. SPAM algorithm [10] uses a bitmap represen-

tation of transaction sequences once the entire database has been loaded in a lexico-

graphic tree. But this algorithm considers that the entire database and all used data

structures should completely fit into main memory, and then do not adapt for large

datasets. GSP algorithm [20] utilizes the property that all contiguous subsequences of

a frequent sequence also have to be frequent. As Apriori , it generates frequent se-

quences, then candidate sequences by adding one or more items. PrefixSPAN [14]

first finds the frequent items after scanning the database once. The sequence database

is then projected, according to the frequent items, into several smaller databases. Fi-

nally, all sequential patterns are found by recursively growing subsequence fragments

in each projected database. Employing a divide-and-conquer strategy with the Pat-

ternGrowth methodology, PrefixSPAN efficiently mines the complete set of patterns.

As for multidimensional sequences mining , it has been studied recently and

partially. Main works deal with frequent patterns that mix sequence items and dimen-

sions. The main contribution is from [16]. They have defined three algorithms for

mining multidimensional sequential patterns : UniSeq uses PrefixSPAN [14] to mine

multidimensional sequential patterns with sequences extended with dimensional in-

formation. Dim-Seq uses the BUC-like algorithm [4] to first mine multidimensional

patterns , then PrefixSPAN is used to mine the sequential patterns associated to the

multidimensional patterns . Seq-Dim first mines multidimensional patterns using

BUC-like algorithm, then uses PrefixSPAN to mine the associated sequential patterns .

Notice that the three algorithms produce the same result. However, they do not make

Search WWH ::

Custom Search

Home