Database Reference
In-Depth Information
Proof. The theorem can be proved directly using a special form of the Chernoff-
Hoeffding bound due to Angluin and Valiant [181].
Theorem 4.4 provides an upper bound of the sample size, which is independent
of the size of data sets. The larger
ε
and
δ
, the smaller the sample size. The larger
σ
, the larger the sample size.
Using the sampling method suggested by Theorem 4.4, we can have a tourna-
ment algorithm using an LT-tree of cost O
( |
|
|
| )
O
log
O
. The algorithm provides a
theoretical bound on the runtime.
However, the sampling method cannot be practically gainful unless on extremely
large data sets. The LT-tree already exploits the locality of instances nicely. When
the data set is not extremely large, the number of instances in the
-neighborhood
of a node is usually (substantially) smaller than the number of samples required for
high approximation quality. In our experiments, the above case is always true. Thus,
we do not include the experimental results on this sampling method.
σ
4.3 Answering Discriminative Typicality Queries
According to Definition 2.6, discriminative typicality can be calculated as follows.
Given two uncertain objects O and S , where O is the target uncertain object, for each
instance o
O , Algorithm 4.1 can be used to compute the simple typicality scores
of o in O and S , respectively. The difference between the two is the discriminative
typicality of o .
Suppose there are m instances in the target object O and n instances in S . To com-
pute the discriminative typicality score of an instance o
O , we have to compute the
simple typicality scores of o in both O and S , which takes O
time. Therefore,
answering top- k discriminative typicality queries using the exact algorithm takes
time O
(
n
+
m
)
(
(
+
))
.
The approximation methods developed for top- k simple typicality queries can
also be adopted to answer top- k discriminative typicality queries.
m
m
n
4.3.1 A Randomized Tournament Algorithm
Generally, the randomized tournament algorithm can be used to answer top- k dis-
criminative typicality queries if the discriminative typicality measure is applied. In
contrast to the randomized tournament algorithm for top- k simple typicality queries,
only the instances in the target object O are involved in the tournament in order to
answer top- k discriminative typicality queries. The other instances are only used to
compute the approximate discriminative typicality scores of the instances in O .
The cost of discriminative typicality computation within one group is O
m
m + n t 2
(
)
.
m + n
t
Since there are O
(
)
groups in total, the complexity of selecting the final winner
 
Search WWH ::




Custom Search