From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

This leads us to propose, in the first part of this chapter, a new algorithm which

aims at enhancing the performances of mining sequential association rules while

reducing resource consumption. We make the following contributions:

1. This algorithm only makes one scan of the database;

2. It is based on a highly compact main memory data structure, saving the required

storage resources;

3. It allows a fast access to the data thanks to index structure;

4. The experimental results show that our algorithm outperforms existing ones.

Mining sequential patterns has many interesting applications as it is. In addition to

performance issue, many works have proposed new features, such as incremental

sequential pattern mining [5] [12], restriction by constraints [14] or dealing with new

types of data, such as query plans [26]. Among interesting extensions, multidimen-

sional sequence mining is a major issue [16]. In fact, it allows discovering rules that

links between sequences (e.g. transaction history) and regular attributes data (such as

those in client file). Such rules may describe customer profiles, e.g. to which category

of individuals a given purchase (or a given path traversal pattern) corresponds, or

discover to which category of individuals correspond a given path traversal pattern.

This is the subject of the second part of this chapter.

Our approach consists in mining individual profiles - based on attributes - for the

most frequent sequential patterns. At this end, we propose a characterization based

approach where a whole sequence is considered as a complex attribute. Thus, it makes

sense to integrate reasoning on sequences (frequent patterns, similarity, grouping)

while other dimensions are considered as descriptive of each sequence group. Briefly,

our approach is based on two steps. The first gathers all database sequences around

the most similar sequential pattern in order to derive classes of sequences represented

by their sequential patterns . The second step describes these classes (and their

sequential patterns) by their multidimensional attributes values characterizing them.

The characteristic rules express which attribute properties are typical to frequent

sequential patterns. The sequential patterns should fulfill a given support threshold,

and the rule should be satisfied with a given confidence threshold. The extraction of

such rules raises three main questions:

1. How to determine that a sequence or a subsequence is similar to another?

2. How to group multidimensional sequences with a given sequential pattern?

3. How to determine the most characteristic properties for a group of sequences?

We have adopted different solutions that we detail afterward.

Both methods have been experimented using a real dataset related to population

daily activity and mobility survey. It aims at mining frequent patterns of activity

sequences, then at analyzing the profile of the population having those typical activity

sequences. In addition, other experiments have been conducted to test the scalability

of the sequential pattern mining algorithm, and use synthetic data and public available

data widely used.

This chapter combines and extends two previously published papers, namely [17]

[18]. It is organized as follows: a background section will provide an overview of the

state of the art, before stating the concepts and definitions used further, and finally, it

Mining Complex Data

Search WWH ::

Custom Search

Home