Mining Sequential Patterns from Probabilistic Databases by Pattern-Growth - Advances in Databases

Database Reference

In-Depth Information

Mining Sequential Patterns from Probabilistic

Databases by Pattern-Growth

Muhammad Muzammal

Department of Computer Science, University of Leicester, UK

mm386@mcs.le.ac.uk

Abstract. We propose a pattern-growth approach for mining sequential

patterns from probabilistic databases . Our considered model of uncer-

tainty is about the situations where there is uncertainty in associating

an event with a source ; and consider the problem of enumerating all

sequences whose expected support satisfies a user-defined threshold θ .

In an earlier work [Muzammal and Raman, PAKDD'11], adapted rep-

resentative candidate generate-and-test approaches, GSP (breadth-first

sequence lattice traversal) and SPADE/SPAM (depth-first sequence lat-

tice traversal) to the probabilistic case. The authors also noted the dif-

ficulties in generalizing PrefixSpan to the probabilistic case (PrefixSpan

is a pattern-growth algorithm, considered to be the best performer for

deterministic sequential pattern mining). We overcome these diculties

in this note and adapt PrefixSpan to work under probabilistic settings.

We then report on an experimental evaluation of the candidate generate-

and-test approaches against the pattern-growth approach.

Keywords: Mining Uncertain Data, Mining complex sequential data,

Probabilistic Databases, Novel models and algorithms.

1

Introduction

Agrawal and Srikant [14,2] defined the problem of Sequential Pattern Mining

(SPM) , which involves discovery of frequent sequences of events in data with

a temporal component; SPM has become a classical and well-studied problem

in data mining [17,13,3]. In classical SPM, the database to be mined consists

of tuples

,where e is an event , σ is a source and eid is an event-

id which incorporates a time-stamp . A tuple may record a retail transaction

(event) by a customer (source), or an observation of an object/person (event)

by a sensor/camera (source). All of the components of the tuple are assumed to

be certain , or completely determined.

However, it is recognized that data obtained from a wide range of data sources

is inherently uncertain [1]. This paper is concerned with SPM in probabilistic

databases [15], a popular framework for modelling uncertainty. Recently sev-

eral data mining and ranking problems have been studied in this framework,

including top- k [18,5], frequent itemset mining (FIM) [1,4] and sequential pat-

tern mining [11,12]. In [11] two kinds of uncertainty in SPM were formalized:

eid, e, σ

Search WWH ::

Custom Search

Home