Top-k Typicality Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

∑ o ∈ A DT (

) − ∑ o ∈ A DT (

−

)

(4.17)

100%

∑ o ∈ A DT (

−

)

The approximation quality of the three methods are shown in Figure 4.5. In gen-

eral, the comparison among RT, DLTA and LT3 is similar to the situation of the

top- k simple typicality query evaluation listed in Figure 4.4.

To test the approximation quality for representative typicality, we conducted var-

ious experiments. By default, the data set contains 5

000 instances with 5 attributes,

and conduct top-10 representative typicality queries. The neighborhood threshold

of DLTA and LT3 is set to 2 h , where h is the bandwidth of Gaussian kernels. The

group size of randomized tournament is set to 10, and 4 validations are conducted.

We adopt the following error rate measure. For a top- k representative typicality

query Q , let A be the set of k instances returned by the exact algorithm, and A be the

set of k instances returned by an approximation algorithm. GT

( A

(

)

and GT

)

are the group typicality scores of A and A , respectively. Then, the error rate e is

( A

= |

(

) −

) |

100%

(4.18)

(

)

The error rate measure computes the difference between the group typicality of

the exact answer and the group typicality of the approximate answer. If the error rate

is small, even the instances in the two answer sets are different, the approximation

to the answer still represents the whole data set well. The approximation quality of

representative typicality approximation is shown in Figure 4.6. The explanations are

similar to the situations of simple typicality queries.

In summary, for all three types of typicality queries, DLTA has the best approxi-

mation quality, while RT gives the largest error rates. LT3 has comparable approxi-

mation quality to DLTA.

4.5.3 Sensitivity to Parameters and Noise

To test the sensitivity of the answers of top- k typicality queries with respect to the

kernel function and the bandwidth value, we use the Quadraped Animal Data Gen-

erator from the UCI Machine Learning Database Repository. to generate synthetic

data sets with 10

000 instances and 5 attributes.

We first fix the bandwidth value h

0 6 s

5 √ n as discussed in Section 4.1.1, and use

the kernel functions listed in Table 4.1 to answer top- k simple typicality/ discrimina-

tive typicality/ representative typicality queries. We compare the results computed

using the Gaussian kernel function and the results computed using some other kernel

functions as follows.

Let the results returned by using the Gaussian kernel be A and the results returned

by using other kernel functions be A , the error rates of the answers to the three typ-

icality queries are computed using Equations 4.16, 4.17 and 4.18, respectively. The

curves are shown in Figure 4.7. The results match the discussion in Section 4.1.1:

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home