Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
2010 ]. The following is a canonical example of a PTQ: “return the ids of all sensors whose temperature
is in some critical range with probability greater than 0 . 6 ”.
To explain the main ideas, it suffices consider the problem in one dimension (of course, the
problem can be generalized). The input is a set of n uncertain points p 1 ,...,p n
R
in
. The query
consists of a range I
⊆ R
, and a threshold confidence value τ . In our example above, τ
=
0 . 6, and
I describes the critical range. The goal is to find all points p j such that P
[ p I ]≥ τ . The true
value of each p i in
R
is a continuous random variable described by a probability density function
f i : R → R + ∪{
.
A common assumption is that the probability density functions f i are specified by (small)
histograms. That is, each f i is a piecewise uniform step function that contains a bounded number
of steps. Consider the case when τ is known before any query begins: for example, we only report
events if they have confidence greater than 0 . 5, but we do not know what is the critical range I .For
this problem, Cheng et al. [ 2004 ]'s idea is to refine these regions with minimum bounding rectangles
similar to an R-tree. The result is an index which is of size O(nτ 1 ) and that can answer queries
in time O(τ 1 log n) . Later, Agarwal et al. [ 2009 ] showed that this problem can be reduced to the
segments below the line problem from computational geometry: index a set of segments in
0
}
2 so that
all segments lying below a query point can be reported quickly. If the threshold is known in advance,
an optimal index can be constructed for the problem: it has size O(n) and supports querying in
O( log n) - for any choice of τ . The basic idea is to break the region using hyperplanes instead of
rectangles. However, how one chooses these hyperplanes requires some sophistication. Agarwal et
al. [ 2009 ] also consider the case where τ is not known in advance, and they are able to obtain indexes
of size O(n log 2 n) with O( log 3 n) query time using recent advances from computational geometry.
Indexing for probabilistic categorical data has been considered as well [ Kimura et al. , 2010 ,
Sarmaetal. , 2008a , Singh et al. , 2007 ]. An interesting observation by Kimura et al. [ 2010 ] is that
researchers have focused on secondary indexes for probabilistic data. In contrast, Kimura et al. [ 2010 ]
advocate the Uncertain Primary Index approach. By making the index a primary index, they save
on expensive IO that other indexes must use to fetch non-probabilistic attributes. Using this idea,
they demonstrate an order of magnitude gains on several real data sets. Additionally, they develop
algorithms and a cost model to maintain the index in the face of updates.
Indexes have also been applied to probabilistic databases with more intricate correlation struc-
ture than BID. For example, pDBs specified by Markov sequences [ Letchner et al. , 2009 ] or graphical
models [ Kanagal and Deshpande , 2010 ]. The central problem is that on these more intricate models,
determining the correlation between two tuples may be computationally expensive. For example, in
a hospital-based RFID application, we may want to know the probability that a crash cart was in
patient A 's room at 9am and then in patient B 's room at 10am. These two events are correlated - not
only with each other but with all events in that hour. Naïvely, we could effectively replay all of these
events to perform inference. Instead, these approaches summarize the contributions of the events
using a skip-list like data structure.
R
Search WWH ::




Custom Search