From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

Activity programs of most individuals may be the same or be similar. Each activity

program could be seen as a sequence of single value, making it possible to discover

frequent activity sequences that characterise groups of the surveyed individuals. This

allows analyzing the mobility of this urban population. Likewise, when considering

transport mode, schedules or duration sequences, it would be possible to determine a

typology of used transport modes, schedules, and so on. Besides, the survey holds

other information about individuals as their age, gender, profession, and so on. Then,

each group having similar activity pattern is likely to have some characteristic

attribute values. Hence, it is very relevant to characterize those groups and their corre-

sponding sequential patterns. Those data have been used in the second part yielding

characteristic rules for the groups of the surveyed individuals that share approxi-

mately those activity patterns. Moreover, other datasets have been used mainly to test

the scalability and to compare our algorithms in the most widely used contexts, such

as public datasets 1 .

8.3 A Sequential Pattern Mining Algorithm

This section proposes an algorithm of sequential pattern mining. We focus on the

specific case where the considered sequences composed of single items instead of

item-sets. We argue that this is the case if the most popular sequences, such as DNA

[9], documents, web-logs, or activity program sequences. Our algorithm is compared

to PrefixSPAN , and SPAN, ones of the most efficient mining algorithms.

8.3.1 Algorithm Overview

The proposed algorithm is two phases. The first stage is the data encoding into a

memory resident data structures. The second one is the frequent generation that in

turn is composed of candidate generation, and candidate support checking. This algo-

rithm only makes one scan of the database during which the total number of distinct

sequences, the frequency of these sequences and the number of sequences by size are

computed. This allows computing the support of each generated sequence.

The backbone of our approach is its main memory data structure, called IBM as

Indexed Bit Map. It is composed of four elements: (i) a Bitmap: a matrix that repre-

sents the distinct sequences of the database, (ii) SV : a sequence vector that encodes all

the ordered combinations of sequences, (iii) INDEX: an index on the Bitmap that

allows a direct access to sequences according to their size, (iv) NB: a table associated

to the Bitmap which informs about the frequency of each distinct sequence (Fig. 8.1).

Only distinct sequences are stored in the Bitmap, they are classified by decreasing

size. An index by size allows a direct access to sequences according to their size,

which optimizes the candidate generation and counting phase of the algorithm. In the

example of Fig. 8.1, IBM encodes the whole distinct sequences of the database.

Notice that Index and the Bitmap are numbered down-up. Here, there are 6 distinct

sequences of size 1 to 5: H, W, HSR, HSH, HSMH and HSRWH. Each cell of the

INDEX indicates the first line where the corresponding size of sequence is stored. For

1 http://kdd.ics.uci.edu

Search WWH ::

Custom Search

Home