Database Reference
In-Depth Information
Figure 6.4(a) shows that when parameter k increases, the runtime of the naive
method and the deterministic method also increases. With a larger k , more instances
are likely to be ranked top- k , and thus more instances have to be read before pruning
techniques take effects. Moreover, the Poisson binomial recurrence used by those
two methods has a linear complexity with respect to k . However, the deterministic
method has a clear advantage over the naive method, which shows the effectiveness
of the compatible dominant set technique, and the pruning using highest possible
rank. The runtime in the sampling methods and the Det-Q method is more sensitive
to k , since those techniques have very small overhead increase as k increases.
In Figure 6.4(b), as the probability threshold increases, the runtime of the naive
method and the deterministic method first increase, and then drop when p is greater
than 0
8. As indicated by Theorem 6.1, if p is small, we can determine that many
streams can pass the threshold after checking only a small number of their instances;
if p is very large, we can also determine that many streams fail the threshold after
checking a small number of instances.
In the synthetic data set, the instances of each stream follow a normal distribution.
If the variance of the distribution is larger, then the ranks of instances are more
diverse. Thus, we may have to scan more instances in order to determine whether
the top- k probability of a stream can pass the probability threshold. Figure 6.4(c)
verifies our analysis.
We test how the two parameters
.
φ
and
ε
affect the efficiency of the methods
1
φ
using quantiles.
is the number of instances kept in a quantile summary. In Fig-
ure 6.4(d), we set the value of
φ
to 0
.
1
,
0
.
05
,
0
.
033
,
0
.
025
,
0
.
02, and the correspond-
ing number of instances in a quantile summary is 10
50, respectively.
Only the runtime of Det-Q increases when more instances are kept. Since
,
20
,
30
,
40
,
is typ-
ically very small and does not affect the runtime remarkably, we omit the details
here.
Figure 6.5 compares the precision and the recall of the three approximation al-
gorithms using the same settings as in Figure 6.4. In general, all three methods have
good approximation quality, and the sampling method achieves a higher precision
and recall. We notice that, in Figure 6.5(b), the recall of the deterministic methods
using quantiles decreases when the probability threshold increases. This is because
a larger probability threshold reduces the size of answer sets. Moreover, in Fig-
ure 6.5(c), the precision and recall of all methods decreases slightly as the variance
increases. When the variance gets larger, the instances of a stream distribute more
sparsely. Thus, we may need more samples to capture the distribution.
We also test our methods on the uncertain data streams generated using the
Gamma distribution
ε
Γ (
k
, θ )
. The mean
μ
is randomly picked from a domain
[
0
,
1000
]
. We change the variance
σ
from 10 to 50. The scale parameter
θ
is set
2
σ
to μ
and the shape parameter k is set to μ
. The efficiency and the approximation
quality are shown in Figure 6.6. The results are very similar to Figure 6.4(c) and
Figure 6.5(c). This shows that the performance of our algorithms is not sensitive to
the types of score distributions of the data sets.
 
Search WWH ::




Custom Search