Probabilistic Ranking Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

i.e., Pr k

. Our objective is to draw a set of samples S of possible worlds,

and compute the mean of X t in S , namely E S [

(

[

X t ]

, as the approximation of E

[

X t ]

We use uniform sampling with replacement. For table T

= {

t 1 ,...,

t n }

and the set

of generation rules

, a sample unit (i.e., an observation) is a possible world. We

generate the sample units under the distribution of T : to pick a sample unit s ,we

scan T once. An independent tuple t i is included in s with probability Pr

(

t i )

. For a

multi-tuple generation rule R : t r 1 ⊕···⊕

(

)

t r m , s takes a probability of Pr

to include

one tuple involved in R .If s takes a tuple in R , then tuple t r l

(

≤

)

is chosen

with probability Pr ( t r l )

Pr ( R )

. s can contain at most 1 tuple from any generation rule.

Once a sample unit s is generated, we compute the top- k tuples in s . For each

tuple t in the top- k list, X t =

1. The indicators for other tuples are set to 0.

The above sample generation process can be repeated so that a sample S is ob-

tained. Then, E S [

. When the sample size is

large enough, the approximation quality can be guaranteed following from the well

known Chernoff-Hoeffding bound [182].

X t ]

can be used to approximate E

[

X t ]

Theorem 5.8 (Sample size). For any

δ ∈ (

)

ε >

0 , and a sample S of possible

3ln 2

worlds, if

|≥

, then for any tuple t, Pr

E S [

X t ] −

[

X t ] |> ε

[

X t ] }≤ δ .

We can implement the sampling method efficiently using the following two tech-

niques, as verified by our experiments.

First, we can sort all tuples in T in the ranking order into a sorted list L . The first

k tuples in a sample unit are the top- k answers in the unit. Thus, when generating a

sample unit, instead of scanning the whole table T , we only need to scan L from the

beginning and generate the tuples in the sample as described before. However, once

the sample unit has k tuples, the generation of this unit can stop. In this way, we

reduce the cost of generating sample units without losing the quality of the sample.

For example, when all tuples are independent, if the average membership probability

, the expected number of tuples we need to scan to generate a sample unit is

which can be much smaller than

Second, in practice, the actual approximation quality may converge well before

the sample size reaches the bound given in Theorem 5.8. Thus, progressive sampling

can be adopted: we generate sample units one by one and compute the estimated

top- k probability of tuples after each unit is drawn. For given parameters d

when k

0 and

φ >

0, the sampling process stops if in the last d sample units the change of the

estimated X t for any tuple t is smaller than

To answer a PT- k query with probability threshold p , we first run the above sam-

pling algorithm. Then, we scan the tuples in L and output the tuples whose estimated

top- k probabilities are at least p .

After obtaining estimated top- k probabilities of tuples using the above sampling

method, top-

(

)

queries and top-

(

)

queries can be answered similarly as dis-

cussed in Section 5.2.1.

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home