Database Reference
In-Depth Information
Mining Sequential Patterns from Probabilistic
Databases by Pattern-Growth
Muhammad Muzammal
Department of Computer Science, University of Leicester, UK
mm386@mcs.le.ac.uk
Abstract. We propose a pattern-growth approach for mining sequential
patterns from probabilistic databases . Our considered model of uncer-
tainty is about the situations where there is uncertainty in associating
an event with a source ; and consider the problem of enumerating all
sequences whose expected support satisfies a user-defined threshold θ .
In an earlier work [Muzammal and Raman, PAKDD'11], adapted rep-
resentative candidate generate-and-test approaches, GSP (breadth-first
sequence lattice traversal) and SPADE/SPAM (depth-first sequence lat-
tice traversal) to the probabilistic case. The authors also noted the dif-
ficulties in generalizing PrefixSpan to the probabilistic case (PrefixSpan
is a pattern-growth algorithm, considered to be the best performer for
deterministic sequential pattern mining). We overcome these diculties
in this note and adapt PrefixSpan to work under probabilistic settings.
We then report on an experimental evaluation of the candidate generate-
and-test approaches against the pattern-growth approach.
Keywords: Mining Uncertain Data, Mining complex sequential data,
Probabilistic Databases, Novel models and algorithms.
1
Introduction
Agrawal and Srikant [14,2] defined the problem of Sequential Pattern Mining
(SPM) , which involves discovery of frequent sequences of events in data with
a temporal component; SPM has become a classical and well-studied problem
in data mining [17,13,3]. In classical SPM, the database to be mined consists
of tuples
,where e is an event , σ is a source and eid is an event-
id which incorporates a time-stamp . A tuple may record a retail transaction
(event) by a customer (source), or an observation of an object/person (event)
by a sensor/camera (source). All of the components of the tuple are assumed to
be certain , or completely determined.
However, it is recognized that data obtained from a wide range of data sources
is inherently uncertain [1]. This paper is concerned with SPM in probabilistic
databases [15], a popular framework for modelling uncertainty. Recently sev-
eral data mining and ranking problems have been studied in this framework,
including top- k [18,5], frequent itemset mining (FIM) [1,4] and sequential pat-
tern mining [11,12]. In [11] two kinds of uncertainty in SPM were formalized:
eid, e, σ
 
Search WWH ::




Custom Search