Database Reference
In-Depth Information
4.1 Answering Simple Typicality Queries
Consider an uncertain object O , the simple typicality value of an instance o
O is
the likelihood of o given that o is a sample of
X
, the underlying random variable
generating the samples in O (Definition 2.5).
In this section, we first discuss how to compute likelihood values, then, we show
that the complexity of answering top- k typicality queries is quadratic. Last, we
present a randomized tournament approximation algorithm (RT). The approxima-
tion algorithm developed for simple typicality computation in this section can be
extended to answer top- k discriminative typicality queries and top- k representative
typicality queries, as will be discussed later in Sections 4.3 and 4.4, respectively.
4.1.1 Likelihood Computation
For an instance o in an uncertain object O , likelihood L
is the posterior prob-
ability of o given instances in O , which can be computed using probability density
estimation methods. There are several model estimation techniques in the litera-
ture [164], including parametric and non-parametric density estimation. Parametric
density estimation requires a certain distribution assumption, while non-parametric
estimation does not. Among the various techniques proposed for non-parametric
density estimation [165], histogram estimation [166], kernel estimation [167, 168]
and nearest neighbor estimation [169] are the most popular. In this work, we use ker-
nel estimation, because it can estimate unknown data distributions effectively [170].
Kernel estimation is a generalization of sampling. In random sampling, each
sample point carries a unit weight. However, an observation of the sample point
increases the chance of observing other points nearby. Therefore, kernel estimator
distributes the weight of each point in the nearby space around according to a ker-
nel function K . The commonly used kernel functions are listed in Table 4.1, where
I
(
o
|
O
)
( |
|≤
)
|
|≤
u
1
in the kernel functions denotes the value 1 when
u
1 holds, and 0 when
|
1 does not hold.
A bandwidth parameter (also known as the smoothing parameter) h is used to
control the distribution among the neighborhood of the sample. As shown in [171],
the quality of the kernel estimation depends mostly on the bandwidth h and lightly
on the choice of the kernel K . Too small bandwidth values cause very spiky curves,
and too large bandwidth values smooth out details. A class of effective methods are
data-driven least-squares cross-validation algorithms [172, 173, 174, 175], which
select the bandwidth value that minimizes integrated square error.
In this work, we choose the commonly used Gaussian kernels. Our approach
can also be adapted to using other kernel functions. We set the bandwidth of the
Gaussian kernel estimator h
u
|≤
0 6 s
5 n as suggested in [175], where n is the cardinality
of the uncertain object O and s is the standard deviation of the instances in O which
can be estimated by sampling. In Section 4.5.3, we evaluate the sensitivity of the
1
.
=
 
Search WWH ::




Custom Search