Mining Sequential Patterns from Probabilistic Databases by Pattern-Growth - Advances in Databases

Database Reference

In-Depth Information

source-level uncertainty (SLU) and event-level uncertainty (ELU) .InSLU,the

“source” attribute of each tuple is uncertain: each tuple contains a probability

distribution over possible sources ( attribute-level uncertainty [15]). As noted in

[11], this formulation applies to scenarios where it is known that some customer

made a specific retail transaction, but the identity of the customer who made

that transaction is uncertain. This could happen because of incomplete cus-

tomer details, or because the customer database itself is probabilistic as a result

of “deduplication” or cleaning [7]. In ELU, the source of the tuple is certain,

but the events are uncertain. This applies to several scenarios involving sensors

in fixed locations detecting events using noisy techniques. In this case, either

the existence of an event is uncertain (for example tracking endangered animals

using sensors [8] or aggregating unreliable observations into events [9]) or the

event's existence is essentially certain, but its content is uncertain (for example,

reading a numberplate using automated numberplate recognition or sequences

of search terms that may have ambiguous meanings [11]).

Our contributions. In [12], ecient algorithms were proposed for the SPM prob-

lem in SLU probabilistic databases, under the expected support measure. These

algorithms were based on the candidate generate-and-test approach, which is

known to be relatively inecient for classical SPM. In [12], it was noted that it

is not straightforward to adapt the pattern-growth approach [13], which is usually

the best for classical SPM, to the probabilistic case. In this paper, we overcome

the obstacles mentioned in [12], and propose a pattern-growth method for the

SPM problem in probabilistic databases. In classical SPM, pattern-growth works

by performing L 1 computation on a projected database. The key contributions of

this work are to formulate the analogue of a projected database in probabilistic

settings, and to identify the appropriate L 1 computation to perform on the pro-

jected database. Unlike the deterministic case, it appears that pattern-growth for

probabilistic SPM appears to require additional memory. We have implemented

the new pattern-growth algorithm and present an experimental evaluation of

the pattern-growth algorithm against the ones presented in [12]; this evaluation

shows that although pattern-growth is superior to candidate generation in the

deterministic settings, the picture is not as clear as for the probabilistic case.

2 Problem Statement

Classical SPM [14,2]. Let

{i 1 ,i 2 ,...,i q }

be a set of items and

{

1 ,...,m}

be a set of sources .An event e ⊆I

is a collection of items. A database

D =

is an ordered list of records such that each r i ∈ D is of the

form ( eid i ,e i ,σ i ), where eid i is a unique event-id, including a time-stamp (events

are ordered by this time-stamp), e i is an event and σ i is a source.

A sequence s =

r 1 ,r 2 ,...,r n

is an ordered list of events. The events s i in the

sequence are called its elements .The length of a sequence s is the total number of

s 1 ,s 2 ,...,s a

items in it, i.e. j =1 |s j |

; for any integer k ,a k -sequence is a sequence of length k .

Let s =

be two sequences. We say that s is a

subsequence of t , denoted s t , if there exist integers 1

s 1 ,s 2 ,...,s q

and t =

t 1 ,t 2 ,...,t r

≤ i 1 <i 2 < ···<i q ≤ r

Advances in Databases

Search WWH ::

Custom Search

Home