From Sequence Mining to Multidimensional Sequence Mining - Mining Complex Data

Information Technology Reference

In-Depth Information

8

From Sequence Mining to Multidimensional

Sequence Mining

Karine Zeitouni

PRISM Laboratory, University of Versailles-St-Quentin

45 Avenue des Etats-Unis, 78035 Versailles, France

Karine.Zeitouni@prism.uvsq.fr

Abstract. Sequential pattern mining has been broadly studied and many algorithms have been

proposed. The first part of this chapter proposes a new algorithm for mining frequent se-

quences. This algorithm processes only one scan of the database thanks to an indexed structure

associated to a bit map representation. Thus, it allows a fast data access and a compact storage

in main memory. Experiments have been conducted using real and synthetic datasets. The

experimental results show the efficiency of our method compared to existing algorithms.

Beyond mining plain sequences, taking into account multidimensional information associated

to sequential data is for a great interest for many applications. In the second part, we propose a

characterization based multidimensional sequential patterns mining. This method first groups

sequences by similarity; then characterizes each cluster using multidimensional properties

describing the sequences. The clusters are built around the frequent sequential patterns. Thus,

the whole process results in rules characterizing sequential patterns using multidimensional

information. This method has been experimented towards a survey on population daily activity

and mobility in order to analyze the profile of the population having typical activity sequences.

The extracted rules show our method effectiveness.

Keywords: Sequential data mining, data structures, algorithms, optimization.

8.1 Introduction

The problem of mining sequential patterns was first introduced in the context of mar-

ket basket analysis [2]. It aims to retrieve frequent patterns in the sequences of prod-

ucts purchased by customers through time ordered transactions. Several algorithms

have been proposed in order to improve the performances and to reduce required

space in memory [20] [26] [11] [13] [10]. Other works have concerned mining

frequent sequences in DNA [9] or Web Usage Mining [19] [22]. In general, it could

apply to any database containing a collection of item sequences. Our target applica-

tion was based on a time-use survey, more precisely the database reports the daily

activities and displacements carried out by each surveyed person in a household.

After having tested the most cited algorithms, we have observed their weakness to

scale with large size datasets. Indeed, most of them perform multiple scans of the

database, which is the main bottleneck in mining. Others require too large memory

space to load the data when the database size increases.

Search WWH ::

Custom Search

Home