Database Reference
In-Depth Information
o A DT (
) o A DT (
o
,
O
,
S
O
o
,
O
,
S
O
)
(4.17)
e
=
·
100%
o A DT (
o
,
O
,
S
O
)
The approximation quality of the three methods are shown in Figure 4.5. In gen-
eral, the comparison among RT, DLTA and LT3 is similar to the situation of the
top- k simple typicality query evaluation listed in Figure 4.4.
To test the approximation quality for representative typicality, we conducted var-
ious experiments. By default, the data set contains 5
000 instances with 5 attributes,
and conduct top-10 representative typicality queries. The neighborhood threshold
,
σ
of DLTA and LT3 is set to 2 h , where h is the bandwidth of Gaussian kernels. The
group size of randomized tournament is set to 10, and 4 validations are conducted.
We adopt the following error rate measure. For a top- k representative typicality
query Q , let A be the set of k instances returned by the exact algorithm, and A be the
set of k instances returned by an approximation algorithm. GT
( A
(
A
,
O
)
and GT
,
O
)
are the group typicality scores of A and A , respectively. Then, the error rate e is
( A
= |
GT
(
A
,
O
)
GT
,
O
) |
e
×
100%
(4.18)
(
,
)
GT
A
O
The error rate measure computes the difference between the group typicality of
the exact answer and the group typicality of the approximate answer. If the error rate
is small, even the instances in the two answer sets are different, the approximation
to the answer still represents the whole data set well. The approximation quality of
representative typicality approximation is shown in Figure 4.6. The explanations are
similar to the situations of simple typicality queries.
In summary, for all three types of typicality queries, DLTA has the best approxi-
mation quality, while RT gives the largest error rates. LT3 has comparable approxi-
mation quality to DLTA.
4.5.3 Sensitivity to Parameters and Noise
To test the sensitivity of the answers of top- k typicality queries with respect to the
kernel function and the bandwidth value, we use the Quadraped Animal Data Gen-
erator from the UCI Machine Learning Database Repository. to generate synthetic
data sets with 10
000 instances and 5 attributes.
We first fix the bandwidth value h
,
0 6 s
5 n as discussed in Section 4.1.1, and use
the kernel functions listed in Table 4.1 to answer top- k simple typicality/ discrimina-
tive typicality/ representative typicality queries. We compare the results computed
using the Gaussian kernel function and the results computed using some other kernel
functions as follows.
Let the results returned by using the Gaussian kernel be A and the results returned
by using other kernel functions be A , the error rates of the answers to the three typ-
icality queries are computed using Equations 4.16, 4.17 and 4.18, respectively. The
curves are shown in Figure 4.7. The results match the discussion in Section 4.1.1:
1
.
=
 
Search WWH ::




Custom Search