Database Reference
In-Depth Information
the answers computed using some other kernel functions are similar to the answers
computed using the Gaussian kernel, since the error rates are very low.
We then use the Gaussian kernel function and vary the bandwidth value from 0
.
5 h
0 6 s
5 n is the default bandwidth value used in other experiments. Let
A be the answer set computed using the default bandwidth value h and A be the an-
swer set computed using other bandwidth values, the error rates are computed using
Equations 4.16, 4.17 and 4.18, respectively. From the results shown in Figure 4.8,
we can see that the answers computed using different bandwidth values are similar
in their typicality/discriminative typicality/group typicality score. Moreover, using
smaller bandwidth values causes less difference than using larger bandwidth values.
Larger bandwidth values smooth out the peaks of the density curves, which are the
most typical points.
In summary, the answers to top- k simple typicality queries, top- k discriminative
typicality queries and top- k representative typicality queries are insensitive to the
choice of kernel functions and the bandwidth values.
Moreover, we evaluate the sensitivity of top- k typicality queries with respect to
noise in data sets. We use the Quadraped Animal Data Generator to generate syn-
thetic data sets with 10
1
.
to 2 h , where h
=
000 instances and 5 attributes. In addition, we add 5% to
15% noise instances whose attribute values are uniformly distributed in the domain
of each attribute. Gaussian kernel function and bandwidth h
,
0 6 s
5 n are used. The
answers returned are denoted by A , and the answers computed when removing the
noises are denoted by A . The error rates in Figure 4.9(a), 4.9(b), and 4.9(c) are com-
puted using Equations 4.16, 4.17 and 4.18, respectively. Clearly, the results of top- k
typicality queries are not sensitive to noise and outliers in data sets.
1
.
=
4.5.4 Efficiency and Scalability
To test the efficiency and the scalability of our methods, we report in Figures 4.10, 4.11,
and 4.12 the runtime in the experiments conducted in Figures 4.4, 4.5, and 4.6, re-
spectively.
As shown in Figure 4.10(a), the runtime of DLTA increases substantially when
the neighborhood threshold increases, but the increase of runtime for the LT3
method is mild, thanks to the tournament mechanism.
Figure 4.10(b) shows that the runtime of DLTA and LT3 is insensitive to the in-
crease of k . LT3 incrementally computes other top- k answers after the top-1 answer
is computed. Thus, computing more answers only takes minor cost. The RT method
has to run the tournaments k rounds, and thus the cost is linear to k .
As shown in Figure 4.10(c), among the four methods, RT is the fastest and the
exact algorithm is the slowest. LT3 and DLTA are in between, and LT3 is faster than
DLTA. All methods are linearly scalable with respect to dimensionality.
 
Search WWH ::




Custom Search