Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

6. ADVANCED TECHNIQUES

2010 ]. The following is a canonical example of a PTQ: “return the ids of all sensors whose temperature

is in some critical range with probability greater than 0 . 6 ”.

To explain the main ideas, it suffices consider the problem in one dimension (of course, the

problem can be generalized). The input is a set of n uncertain points p 1 ,...,p n

. The query

consists of a range I

⊆ R

, and a threshold confidence value τ . In our example above, τ

0 . 6, and

I describes the critical range. The goal is to find all points p j such that P

[ p ∈ I ]≥ τ . The true

value of each p i in

is a continuous random variable described by a probability density function

f i : R → R + ∪{

A common assumption is that the probability density functions f i are specified by (small)

histograms. That is, each f i is a piecewise uniform step function that contains a bounded number

of steps. Consider the case when τ is known before any query begins: for example, we only report

events if they have confidence greater than 0 . 5, but we do not know what is the critical range I .For

this problem, Cheng et al. [ 2004 ]'s idea is to refine these regions with minimum bounding rectangles

similar to an R-tree. The result is an index which is of size O(nτ − 1 ) and that can answer queries

in time O(τ − 1 log n) . Later, Agarwal et al. [ 2009 ] showed that this problem can be reduced to the

segments below the line problem from computational geometry: index a set of segments in

}

2 so that

all segments lying below a query point can be reported quickly. If the threshold is known in advance,

an optimal index can be constructed for the problem: it has size O(n) and supports querying in

O( log n) - for any choice of τ . The basic idea is to break the region using hyperplanes instead of

rectangles. However, how one chooses these hyperplanes requires some sophistication. Agarwal et

al. [ 2009 ] also consider the case where τ is not known in advance, and they are able to obtain indexes

of size O(n log 2 n) with O( log 3 n) query time using recent advances from computational geometry.

Indexing for probabilistic categorical data has been considered as well [ Kimura et al. , 2010 ,

Sarmaetal. , 2008a , Singh et al. , 2007 ]. An interesting observation by Kimura et al. [ 2010 ] is that

researchers have focused on secondary indexes for probabilistic data. In contrast, Kimura et al. [ 2010 ]

advocate the Uncertain Primary Index approach. By making the index a primary index, they save

on expensive IO that other indexes must use to fetch non-probabilistic attributes. Using this idea,

they demonstrate an order of magnitude gains on several real data sets. Additionally, they develop

algorithms and a cost model to maintain the index in the face of updates.

Indexes have also been applied to probabilistic databases with more intricate correlation struc-

ture than BID. For example, pDBs specified by Markov sequences [ Letchner et al. , 2009 ] or graphical

models [ Kanagal and Deshpande , 2010 ]. The central problem is that on these more intricate models,

determining the correlation between two tuples may be computationally expensive. For example, in

a hospital-based RFID application, we may want to know the probability that a crash cart was in

patient A 's room at 9am and then in patient B 's room at 10am. These two events are correlated - not

only with each other but with all events in that hour. Naïvely, we could effectively replay all of these

events to perform inference. Instead, these approaches summarize the contributions of the events

using a skip-list like data structure.

Probabilistic Databases

Search WWH ::

Custom Search

Home