Database Reference
In-Depth Information
RC
RC+AR
Sampling
RC+LR
RC
Sampling
RC+AR
RC +L R
60
50
50
40
40
30
30
20
20
10
10
0
0
20000
40000
60000
80000
100000
500
1000
1500
2000
2500
Number of tuples in the data set
Number of rules
(a) Runtime vs. cardinality.
(b) Runtime vs. number of rules.
Fig. 5.8 Scalability.
discussed in Section 5.2.3, a tuple t failing the probability threshold still has to be
retrieved if some tuples ranked lower than t may satisfy the threshold.
Figure 5.4 verifies the effectiveness of the pruning techniques discussed in Sec-
tion 5.2.3. With the pruning techniques, the exact algorithm only accesses a small
portion of the tuples in the data set. Interestingly, the average sample length is close
to the number of tuples scanned in the exact algorithm, which verifies the effective-
ness of our sampling techniques. Moreover, the exact algorithm and the sampling
algorithm access fewer tuples than the number computed by the general stopping
condition, while the number computed by the stopping condition is close to the real
stopping point, which shows the effectiveness of the stopping condition.
5.6.2.2 Efficiency and Approximation Quality
Figure 5.5 compares the runtime of the three versions of the exact algorithm and the
sampling algorithm with respect to the four aspects tested in Figure 5.4. The runtime
of the Poisson approximation based method is always less than one second, so we
omit it in Figure 5.5 for the sake of the readability of the figures. We also count the
number of times in the three versions of the exact algorithm that subset probability
values are computed. The trends are exactly the same as their runtime, therefore,
we omit the figures here. The results confirm that the rule-tuple compression tech-
nique and the reordering techniques speed up the exact algorithm substantially. Lazy
reordering always outperforms aggressive reordering substantially.
Compared to the exact algorithm, the sampling method is generally more stable
in runtime. Interestingly, the exact algorithm (Exact+LR) and the sampling algo-
rithm each has its edge. For example, when k is small, the exact algorithm is faster.
The sampling method is the winner when k is large. As k increases, more tuples
need to be scanned in the exact algorithm, and those tuples may be revisited in sub-
set probability computation. But the only overhead in the sampling method is to
scan more tuples when generating a sample unit, which is linear in k . This justifies
the need for both the exact algorithm and the sampling algorithm.
 
Search WWH ::




Custom Search