Database Reference
In-Depth Information
The main idea of these approaches is the following. Recall that a Markov Sequence of length
N + 1 is a sequence of random variables X ( 0 ) ,...,X (N)
taking values in that obey the Markov
Property. A consequence of this property is:
[ X (k + 1 )
| X ( 1 ) ...X (k)
[ X (k + 1 )
| X (k)
P
]=
P
]
[ X (i)
= σ | X (j)
= σ ]
where
i, j, σ, σ are specified as input. Retrieving this correlation information is crucial for efficient query
processing.
The first idea is that we can write the probability computation as a Matrix multiplication
and then use ideas similar to repeated squaring to summarize large chunks. Let C (i)
The technical goal of these indexing approaches is to compute P
N × N
∈ R
for
i
=
1 ,...,N where
C (i)
[ X (i + 1 )
= σ | X (i)
= σ ]
σ,σ =
P
Then, observe that the product C (i) C (i + 1 ) gives the conditional probability matrix of two-steps, that
is:
C (i) C (i + 1 ) σ,σ =
[ X (i + 2 )
= σ | X (i)
= σ ]
P
Applying this idea repeatedly, we can get the conditional probability of events that are far away
in the sequence. We can then precompute all possible transition matrices, i.e., store P
[ X (j)
| X (i)
]
N 2 ) pairs of
indexes. In some applications, this quadratic space is too high, and we are willing to sacrifice query
performance for lower space. An idea due to Letchner et al. [ 2009 ] is based on the standard data
structure, the skip list. We instead store the transition matrices: P
for every i
j . This would allow O( 1 ) querying but at the expense of storing all (
[ X 2 i
| X 1
for i = 1 ,..., log N .
The storage of this approach is linear in the original data, while achieving O( log r) query time where
r is the distance in the sequence.
Kanagal and Deshpande [ 2010 ] extended this idea to more sophisticated graphical models.
Here, the correlations may be more general (e.g., not only temporal). And the correlation structure
is often more sophisticated than a simple linear chain as in Markov Sequences. Nevertheless, using
a similar skip-list style data structure, both approaches are able to summarize large portions of the
model without looking at them. In both cases, this results in a large improvement in performance.
It is natural to wonder if correlations that are “far apart” are worth the cost of computing
exactly. That is, in an RFID setting, to what extent does a person's location at 9am help predicate
their location at 5pm? The cost of storing this correlation information is that at query time we
have to fetch a large amount of historical data. Understanding such quality-performance questions
is an interesting (and open) question. A first empirical study of this question was recently done by
Letchner et al. [ 2010 ] in the context of Markovian Streams.
]
Search WWH ::




Custom Search