Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

The main idea of these approaches is the following. Recall that a Markov Sequence of length

N + 1 is a sequence of random variables X ( 0 ) ,...,X (N)

taking values in that obey the Markov

Property. A consequence of this property is:

[ X (k + 1 )

| X ( 1 ) ...X (k)

[ X (k + 1 )

| X (k)

]

[ X (i)

= σ | X (j)

= σ ]

where

i, j, σ, σ are specified as input. Retrieving this correlation information is crucial for efficient query

processing.

The first idea is that we can write the probability computation as a Matrix multiplication

and then use ideas similar to repeated squaring to summarize large chunks. Let C (i)

The technical goal of these indexing approaches is to compute P

N × N

∈ R

for

1 ,...,N where

C (i)

[ X (i + 1 )

= σ | X (i)

= σ ]

σ,σ =

Then, observe that the product C (i) C (i + 1 ) gives the conditional probability matrix of two-steps, that

is:

C (i) C (i + 1 ) σ,σ =

[ X (i + 2 )

= σ | X (i)

= σ ]

Applying this idea repeatedly, we can get the conditional probability of events that are far away

in the sequence. We can then precompute all possible transition matrices, i.e., store P

[ X (j)

| X (i)

]

N 2 ) pairs of

indexes. In some applications, this quadratic space is too high, and we are willing to sacrifice query

performance for lower space. An idea due to Letchner et al. [ 2009 ] is based on the standard data

structure, the skip list. We instead store the transition matrices: P

for every i

≤

j . This would allow O( 1 ) querying but at the expense of storing all (

≈

[ X 2 i

| X 1

for i = 1 ,..., log N .

The storage of this approach is linear in the original data, while achieving O( log r) query time where

r is the distance in the sequence.

Kanagal and Deshpande [ 2010 ] extended this idea to more sophisticated graphical models.

Here, the correlations may be more general (e.g., not only temporal). And the correlation structure

is often more sophisticated than a simple linear chain as in Markov Sequences. Nevertheless, using

a similar skip-list style data structure, both approaches are able to summarize large portions of the

model without looking at them. In both cases, this results in a large improvement in performance.

It is natural to wonder if correlations that are “far apart” are worth the cost of computing

exactly. That is, in an RFID setting, to what extent does a person's location at 9am help predicate

their location at 5pm? The cost of storing this correlation information is that at query time we

have to fetch a large amount of historical data. Understanding such quality-performance questions

is an interesting (and open) question. A first empirical study of this question was recently done by

Letchner et al. [ 2010 ] in the context of Markovian Streams.

]

Probabilistic Databases

Search WWH ::

Custom Search

Home