Database Reference
In-Depth Information
Figure 5.6 compares the precision and the recall of the sampling method and the
Poisson approximation based method. The sampling method achieves better results
in general. However, the precision and the recall of the Poisson approximation based
method is always higher than 85% with the runtime less than one second. Thus, it is
a good choice when the efficiency is a concern.
The recall of the Poisson approximation based method increases significantly
when the query parameter k increases. As indicated in [185], the Poisson distri-
bution approximates the Poisson binomial distribution well when the number of
Poisson trials is large. When the parameter k increases, more tuples are read before
the stopping condition is satisfied. Thus, the Poisson approximation based method
provides better approximation for the top- k probability values.
Figure 5.7(a) tests the average error rate of the top- k probability approximation
using the sampling method. Suppose the top- k probability of tuple t is Pr k
(
t
)
, and
the top- k probability estimated by the sampling method is Pr k
(
t
)
, the average error
) Pr k
rate is defined as Pr k ( t ) > p | Pr k
Pr k
(
t
(
t
) |/
(
t
)
. For reference, we also plot the error
bound calculated from the Chernoff-Hoeffding bound [182] given the sample size.
We can clearly see that the error rate of the sampling method in practice is much
better than the theoretical upper bound.
Moreover, Figure 5.7(b) shows the precision and recall of the sampling method.
The precision is the percentage of tuples returned by the sampling method that are in
the actual top- k list returned by the exact algorithm. The recall is the percentage of
tuples returned by the exact method that are also returned by the sampling method.
The results show that the sampling method only needs to draw a small number of
samples to achieve good precision and recall. With a larger k value, more samples
have to be drawn to achieve the same quality
Pr k
|{
t
|
(
t
) >
p
}|
5.6.2.3 Scalability
Last, Figure 5.8 shows the scalability of the exact algorithm and the sampling algo-
rithm. In Figure 5.8(a), we vary the number of tuples from 20
000, and
set the number of multi-tuple rules to 10% of the number of tuples. We set k
,
000 to 100
,
=
200
and p
3. The runtime increases mildly when the database size increases. Due to
the pruning rules and the improvement on extracting sample units, the scan depth
(i.e., the number of tuples read) in the exact algorithm and the sampling algorithm
mainly depends on k and is insensible to the total number of tuples in the data set.
In Figure 5.8(b), we fix the number of tuples to 20
=
0
.
,
000, and vary the number of
,
rules from 500 to 2
500. The runtime of the algorithms increases since more rules
lead to smaller tuple probabilities and more scans tuples back and forth in the span
of rules. However, the reordering techniques can handle the rule complexity nicely,
and make Exact+AR and Exact+LR scalable.
In all the above situations, the runtime of the Poisson approximation based
method is insensitive to those factors, and remains within 1 second.
Search WWH ::




Custom Search