Top-k Typicality Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

4.1 Answering Simple Typicality Queries

Consider an uncertain object O , the simple typicality value of an instance o

∈

O is

the likelihood of o given that o is a sample of

X

, the underlying random variable

generating the samples in O (Definition 2.5).

In this section, we first discuss how to compute likelihood values, then, we show

that the complexity of answering top- k typicality queries is quadratic. Last, we

present a randomized tournament approximation algorithm (RT). The approxima-

tion algorithm developed for simple typicality computation in this section can be

extended to answer top- k discriminative typicality queries and top- k representative

typicality queries, as will be discussed later in Sections 4.3 and 4.4, respectively.

4.1.1 Likelihood Computation

For an instance o in an uncertain object O , likelihood L

is the posterior prob-

ability of o given instances in O , which can be computed using probability density

estimation methods. There are several model estimation techniques in the litera-

ture [164], including parametric and non-parametric density estimation. Parametric

density estimation requires a certain distribution assumption, while non-parametric

estimation does not. Among the various techniques proposed for non-parametric

density estimation [165], histogram estimation [166], kernel estimation [167, 168]

and nearest neighbor estimation [169] are the most popular. In this work, we use ker-

nel estimation, because it can estimate unknown data distributions effectively [170].

Kernel estimation is a generalization of sampling. In random sampling, each

sample point carries a unit weight. However, an observation of the sample point

increases the chance of observing other points nearby. Therefore, kernel estimator

distributes the weight of each point in the nearby space around according to a ker-

nel function K . The commonly used kernel functions are listed in Table 4.1, where

I

(

o

|

O

)

( |

|≤

)

|

|≤

u

1

in the kernel functions denotes the value 1 when

u

1 holds, and 0 when

|

1 does not hold.

A bandwidth parameter (also known as the smoothing parameter) h is used to

control the distribution among the neighborhood of the sample. As shown in [171],

the quality of the kernel estimation depends mostly on the bandwidth h and lightly

on the choice of the kernel K . Too small bandwidth values cause very spiky curves,

and too large bandwidth values smooth out details. A class of effective methods are

data-driven least-squares cross-validation algorithms [172, 173, 174, 175], which

select the bandwidth value that minimizes integrated square error.

In this work, we choose the commonly used Gaussian kernels. Our approach

can also be adapted to using other kernel functions. We set the bandwidth of the

Gaussian kernel estimator h

u

|≤

0 6 s

5 √ n as suggested in [175], where n is the cardinality

of the uncertain object O and s is the standard deviation of the instances in O which

can be estimated by sampling. In Section 4.5.3, we evaluate the sensitivity of the

1

.

=

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home