Continuous Ranking Queries on Uncertain Streams - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

Figure 6.4(a) shows that when parameter k increases, the runtime of the naive

method and the deterministic method also increases. With a larger k , more instances

are likely to be ranked top- k , and thus more instances have to be read before pruning

techniques take effects. Moreover, the Poisson binomial recurrence used by those

two methods has a linear complexity with respect to k . However, the deterministic

method has a clear advantage over the naive method, which shows the effectiveness

of the compatible dominant set technique, and the pruning using highest possible

rank. The runtime in the sampling methods and the Det-Q method is more sensitive

to k , since those techniques have very small overhead increase as k increases.

In Figure 6.4(b), as the probability threshold increases, the runtime of the naive

method and the deterministic method first increase, and then drop when p is greater

than 0

8. As indicated by Theorem 6.1, if p is small, we can determine that many

streams can pass the threshold after checking only a small number of their instances;

if p is very large, we can also determine that many streams fail the threshold after

checking a small number of instances.

In the synthetic data set, the instances of each stream follow a normal distribution.

If the variance of the distribution is larger, then the ranks of instances are more

diverse. Thus, we may have to scan more instances in order to determine whether

the top- k probability of a stream can pass the probability threshold. Figure 6.4(c)

verifies our analysis.

We test how the two parameters

and

affect the efficiency of the methods

using quantiles.

is the number of instances kept in a quantile summary. In Fig-

ure 6.4(d), we set the value of

to 0

033

025

02, and the correspond-

ing number of instances in a quantile summary is 10

50, respectively.

Only the runtime of Det-Q increases when more instances are kept. Since

is typ-

ically very small and does not affect the runtime remarkably, we omit the details

here.

Figure 6.5 compares the precision and the recall of the three approximation al-

gorithms using the same settings as in Figure 6.4. In general, all three methods have

good approximation quality, and the sampling method achieves a higher precision

and recall. We notice that, in Figure 6.5(b), the recall of the deterministic methods

using quantiles decreases when the probability threshold increases. This is because

a larger probability threshold reduces the size of answer sets. Moreover, in Fig-

ure 6.5(c), the precision and recall of all methods decreases slightly as the variance

increases. When the variance gets larger, the instances of a stream distribute more

sparsely. Thus, we may need more samples to capture the distribution.

We also test our methods on the uncertain data streams generated using the

Gamma distribution

Γ (

, θ )

. The mean

is randomly picked from a domain

[

1000

]

. We change the variance

from 10 to 50. The scale parameter

is set

to μ

and the shape parameter k is set to μ

. The efficiency and the approximation

quality are shown in Figure 6.6. The results are very similar to Figure 6.4(c) and

Figure 6.5(c). This shows that the performance of our algorithms is not sensitive to

the types of score distributions of the data sets.

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home