Top-k Typicality Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

the answers computed using some other kernel functions are similar to the answers

computed using the Gaussian kernel, since the error rates are very low.

We then use the Gaussian kernel function and vary the bandwidth value from 0

.

5 h

0 6 s

5 √ n is the default bandwidth value used in other experiments. Let

A be the answer set computed using the default bandwidth value h and A be the an-

swer set computed using other bandwidth values, the error rates are computed using

Equations 4.16, 4.17 and 4.18, respectively. From the results shown in Figure 4.8,

we can see that the answers computed using different bandwidth values are similar

in their typicality/discriminative typicality/group typicality score. Moreover, using

smaller bandwidth values causes less difference than using larger bandwidth values.

Larger bandwidth values smooth out the peaks of the density curves, which are the

most typical points.

In summary, the answers to top- k simple typicality queries, top- k discriminative

typicality queries and top- k representative typicality queries are insensitive to the

choice of kernel functions and the bandwidth values.

Moreover, we evaluate the sensitivity of top- k typicality queries with respect to

noise in data sets. We use the Quadraped Animal Data Generator to generate syn-

thetic data sets with 10

1

.

to 2 h , where h

=

000 instances and 5 attributes. In addition, we add 5% to

15% noise instances whose attribute values are uniformly distributed in the domain

of each attribute. Gaussian kernel function and bandwidth h

,

0 6 s

5 √ n are used. The

answers returned are denoted by A , and the answers computed when removing the

noises are denoted by A . The error rates in Figure 4.9(a), 4.9(b), and 4.9(c) are com-

puted using Equations 4.16, 4.17 and 4.18, respectively. Clearly, the results of top- k

typicality queries are not sensitive to noise and outliers in data sets.

1

.

=

4.5.4 Efficiency and Scalability

To test the efficiency and the scalability of our methods, we report in Figures 4.10, 4.11,

and 4.12 the runtime in the experiments conducted in Figures 4.4, 4.5, and 4.6, re-

spectively.

As shown in Figure 4.10(a), the runtime of DLTA increases substantially when

the neighborhood threshold increases, but the increase of runtime for the LT3

method is mild, thanks to the tournament mechanism.

Figure 4.10(b) shows that the runtime of DLTA and LT3 is insensitive to the in-

crease of k . LT3 incrementally computes other top- k answers after the top-1 answer

is computed. Thus, computing more answers only takes minor cost. The RT method

has to run the tournaments k rounds, and thus the cost is linear to k .

As shown in Figure 4.10(c), among the four methods, RT is the fastest and the

exact algorithm is the slowest. LT3 and DLTA are in between, and LT3 is faster than

DLTA. All methods are linearly scalable with respect to dimensionality.

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home