Probabilistic Ranking Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

Figure 5.6 compares the precision and the recall of the sampling method and the

Poisson approximation based method. The sampling method achieves better results

in general. However, the precision and the recall of the Poisson approximation based

method is always higher than 85% with the runtime less than one second. Thus, it is

a good choice when the efficiency is a concern.

The recall of the Poisson approximation based method increases significantly

when the query parameter k increases. As indicated in [185], the Poisson distri-

bution approximates the Poisson binomial distribution well when the number of

Poisson trials is large. When the parameter k increases, more tuples are read before

the stopping condition is satisfied. Thus, the Poisson approximation based method

provides better approximation for the top- k probability values.

Figure 5.7(a) tests the average error rate of the top- k probability approximation

using the sampling method. Suppose the top- k probability of tuple t is Pr k

(

)

, and

the top- k probability estimated by the sampling method is Pr k

(

)

, the average error

) − Pr k

rate is defined as ∑ Pr k ( t ) > p | Pr k

Pr k

(

) |/

(

)

. For reference, we also plot the error

bound calculated from the Chernoff-Hoeffding bound [182] given the sample size.

We can clearly see that the error rate of the sampling method in practice is much

better than the theoretical upper bound.

Moreover, Figure 5.7(b) shows the precision and recall of the sampling method.

The precision is the percentage of tuples returned by the sampling method that are in

the actual top- k list returned by the exact algorithm. The recall is the percentage of

tuples returned by the exact method that are also returned by the sampling method.

The results show that the sampling method only needs to draw a small number of

samples to achieve good precision and recall. With a larger k value, more samples

have to be drawn to achieve the same quality

Pr k

(

) >

5.6.2.3 Scalability

Last, Figure 5.8 shows the scalability of the exact algorithm and the sampling algo-

rithm. In Figure 5.8(a), we vary the number of tuples from 20

000, and

set the number of multi-tuple rules to 10% of the number of tuples. We set k

000 to 100

200

and p

3. The runtime increases mildly when the database size increases. Due to

the pruning rules and the improvement on extracting sample units, the scan depth

(i.e., the number of tuples read) in the exact algorithm and the sampling algorithm

mainly depends on k and is insensible to the total number of tuples in the data set.

In Figure 5.8(b), we fix the number of tuples to 20

000, and vary the number of

rules from 500 to 2

500. The runtime of the algorithms increases since more rules

lead to smaller tuple probabilities and more scans tuples back and forth in the span

of rules. However, the reordering techniques can handle the rule complexity nicely,

and make Exact+AR and Exact+LR scalable.

In all the above situations, the runtime of the Poisson approximation based

method is insensitive to those factors, and remains within 1 second.

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home