Top-k Typicality Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

rounds of validations, the quality of randomized tournament may increase, but after

3 rounds, the quality improvement is very small.

Figure 4.4(a) shows the change of approximation quality with respect to the

neighborhood threshold. In the figure, the error bounds given by Theorems 4.1

and 4.3 are also plotted, which are labeled as UB

, respec-

tively. To make the curves legible, the error rates are in the logarithmic scale.

Clearly, the larger the neighborhood threshold, the more accurate the local typical-

ity approximation. Our methods perform much better than the error bounds, which

shows that they are effective in practice.

In Figure 4.4(b), we vary the value of k in the top- k queries. The approximation

quality of RT is not sensitive to k , since it runs k times to select the top- k answers.

Both DLTA and LT3 see a larger error rate with a larger value of k , this is because

the distant neighbors may get a better chance to play a role in typicality when more

instances are returned.

Figure 4.4(c) shows the impact of dimensionality on the error rate. DLTA

achieves the best approximation quality, the error rate is up to 0

(

DLTA

)

and UB

(

LT 3

)

066%. LT3 has

an accuracy close to DLTA, and is much better than RT. The error rate decreases

as the dimensionality increases, since the average pairwise distance among the in-

stances in the object also increases and the local typicality approximation becomes

more effective.

Figure 4.4(d) tests the approximation quality versus the number of instances.

When the cardinality increases, the instances becomes denser, and the local typical-

ity approximation is more accurate. That is why LT3 and DLTA perform better with

larger number of instances. However, the approximation quality of RT decreases in

large number of instances, since with a fixed tournament group size, the larger the

number of instances, the more likely the most typical instance in a random group is

biased.

In summary, DLTA and LT3 both achieve better accuracy than RT, which strongly

justifies the effectiveness of our local typicality approximation technique. The accu-

racy of LT3 is slightly lower than DLTA, but as we will show in Section 4.5.4, LT3

is much more efficient than DLTA.

We also report the approximation quality of top- k discriminative typicality query

answering algorithms. By default, the data contains 10

000 instances with 5 at-

tributes. We randomly assign 20% of the instances into the target object O , and

conduct top-10 discriminative typicality queries. The neighborhood threshold

DLTA and LT3 is set to 2 h , where h is the bandwidth of Gaussian kernels. The group

size of randomized tournament is set to 50, and 4 validations are conducted. Here we

increase the tournament size to 50, since only 20% instances are actually involved

in the tournament, as we explained in Section 4.3.1.

The error rate measure is defined as follows. We normalize DT

(

−

)

DT (

−

(

−

) −

min x ∈ C DT

(

−

)

, in order to make the

value always non-negative. For a top- k discriminative typicality

query Q , let A be the set of k instances returned by the exact algorithm, and A be the

set of k instances returned by an approximation algorithm. Then, the error rate e is

(

−

)

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home