Continuous Ranking Queries on Uncertain Streams - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

6.2 A Sampling Method

In this section, we propose a sampling method to estimate the top- k probability of

each stream with a probabilistic quality guarantee.

For a stream O in a sliding window W t

, we are interested in the event that O

is ranked top- k . Let Z O be the indicator to the event: Z O =

( O )

1if O is ranked top- k in

W t

Pr k

( O )

; Z O =

(

Z O =

(

)

0 otherwise. Then, Pr

To approximate the probability Pr

(

Z O

)

, we design a statistic experiment as

follows. We draw samples of possible worlds and compute the top- k lists on the

samples. That is, in a sample, for each stream O we select an instance o in window

W t

. A sample is a possible world. Then, we sort all instances in the sample in

the ranking order and find the top- k list.

We repeat the experiment m times independently. Let Z O , i be the value of Z O at

the i -th run. Then, E

(

)

i = 1 Z O , i is an estimation of E

By using a sufficiently large number of samples, we can obtain a good approx-

imation of Pr k

[

Z O ]=

∑

[

Z O ]=

(

Z O =

)

(

)

with high probability. The methods follows the idea of unre-

stricted random sampling (also known as simple random sampling with replace-

ment) [34]. The following minimum sample size can be derived from the well

known Chernoff-Hoffding bound [182].

Z O

Theorem 6.5 (Minimum sample size). For any stream O, let Z O , 1 ,...,

Z O , m be the

values of Z O in m independent experiments, and E

[

Z O ]=

1 Z O , i . For any

∑

3ln 2

{| E

( 0

< δ <

1 ),

(

ξ >

0 ), if m

≥

, then Pr

[

Z O ] −

[

Z O ] |> ξ }≤ δ

For efficient implementation, we maintain an indicator variable for each stream.

To avoid sorting instances repeatedly, we first sort all instances in a sliding window

W t

. When drawing a sample, we scan the sorted list from the beginning, and

select an instance o

( O )

if stream O has no instance in a sample

yet. If an instance is chosen, the corresponding stream indicator is set. When the

sample already contains k instances, the scan stops since the instances sampled later

cannot be in the top- k list. The sample can then be discarded since it will not be used

later.

The space complexity of the sampling method is O

∈

O in probability

, because all instances in

the sliding window have to be stored. The time complexity is O

(

ω)

(

ω +

log

(

ω))

for the first window and O

for other windows where m is the

number of samples drawn, since the n new instances can be inserted into the sorted

list in W t

(

ω +

n log

(

ω))

to form the sorted list in W t + 1

( O )

6.3 Space Efficient Methods

In the exact algorithm and the sampling algorithm, we need to store the sliding

window in main memory. In some applications, there can be a large number of

streams and the window width is non-trivial. In this section, we develop the space

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home