Top-k Typicality Queries on Uncertain Data - Ranking Queries on Uncertain Data

Database Reference

In-Depth Information

Proof. The theorem can be proved directly using a special form of the Chernoff-

Hoeffding bound due to Angluin and Valiant [181].

Theorem 4.4 provides an upper bound of the sample size, which is independent

of the size of data sets. The larger

ε

and

δ

, the smaller the sample size. The larger

σ

, the larger the sample size.

Using the sampling method suggested by Theorem 4.4, we can have a tourna-

ment algorithm using an LT-tree of cost O

( |

|

| )

O

log

O

. The algorithm provides a

theoretical bound on the runtime.

However, the sampling method cannot be practically gainful unless on extremely

large data sets. The LT-tree already exploits the locality of instances nicely. When

the data set is not extremely large, the number of instances in the

-neighborhood

of a node is usually (substantially) smaller than the number of samples required for

high approximation quality. In our experiments, the above case is always true. Thus,

we do not include the experimental results on this sampling method.

σ

4.3 Answering Discriminative Typicality Queries

According to Definition 2.6, discriminative typicality can be calculated as follows.

Given two uncertain objects O and S , where O is the target uncertain object, for each

instance o

O , Algorithm 4.1 can be used to compute the simple typicality scores

of o in O and S , respectively. The difference between the two is the discriminative

typicality of o .

Suppose there are m instances in the target object O and n instances in S . To com-

pute the discriminative typicality score of an instance o

∈

O , we have to compute the

simple typicality scores of o in both O and S , which takes O

∈

time. Therefore,

answering top- k discriminative typicality queries using the exact algorithm takes

time O

(

n

+

m

)

(

+

))

.

The approximation methods developed for top- k simple typicality queries can

also be adopted to answer top- k discriminative typicality queries.

m

n

4.3.1 A Randomized Tournament Algorithm

Generally, the randomized tournament algorithm can be used to answer top- k dis-

criminative typicality queries if the discriminative typicality measure is applied. In

contrast to the randomized tournament algorithm for top- k simple typicality queries,

only the instances in the target object O are involved in the tournament in order to

answer top- k discriminative typicality queries. The other instances are only used to

compute the approximate discriminative typicality scores of the instances in O .

The cost of discriminative typicality computation within one group is O

m

m + n t 2

(

)

.

m + n

t

Since there are O

(

)

groups in total, the complexity of selecting the final winner

Ranking Queries on Uncertain Data

Search WWH ::

Custom Search

Home