Database Reference
In-Depth Information
Mining Sequential Patterns from Probabilistic
Databases by Pattern-Growth
Muhammad Muzammal
Department of Computer Science, University of Leicester, UK
mm386@mcs.le.ac.uk
Abstract.
We propose a
pattern-growth
approach for mining
sequential
patterns
from
probabilistic databases
. Our considered model of uncer-
tainty is about the situations where there is uncertainty in associating
an
event
with a
source
; and consider the problem of enumerating all
sequences whose
expected support
satisfies a user-defined threshold
θ
.
In an earlier work [Muzammal and Raman, PAKDD'11], adapted rep-
resentative candidate generate-and-test approaches, GSP (breadth-first
sequence lattice traversal) and SPADE/SPAM (depth-first sequence lat-
tice traversal) to the probabilistic case. The authors also noted the dif-
ficulties in generalizing PrefixSpan to the probabilistic case (PrefixSpan
is a pattern-growth algorithm, considered to be the best performer for
deterministic sequential pattern mining). We overcome these diculties
in this note and adapt PrefixSpan to work under probabilistic settings.
We then report on an experimental evaluation of the candidate generate-
and-test approaches against the pattern-growth approach.
Keywords:
Mining Uncertain Data, Mining complex sequential data,
Probabilistic Databases, Novel models and algorithms.
1
Introduction
Agrawal and Srikant [14,2] defined the problem of
Sequential Pattern Mining
(SPM)
, which involves discovery of frequent sequences of events in data with
a temporal component; SPM has become a classical and well-studied problem
in data mining [17,13,3]. In classical SPM, the database to be mined consists
of tuples
,where
e
is an
event
,
σ
is a
source
and
eid
is an
event-
id
which incorporates a
time-stamp
. A tuple may record a retail transaction
(event) by a customer (source), or an observation of an object/person (event)
by a sensor/camera (source). All of the components of the tuple are assumed to
be
certain
, or completely determined.
However, it is recognized that data obtained from a wide range of data sources
is inherently uncertain [1]. This paper is concerned with SPM in
probabilistic
databases
[15], a popular framework for modelling uncertainty. Recently sev-
eral data mining and ranking problems have been studied in this framework,
including top-
k
[18,5], frequent itemset mining (FIM) [1,4] and sequential pat-
tern mining [11,12]. In [11] two kinds of uncertainty in SPM were formalized:
eid, e, σ