Database Reference
In-Depth Information
critical region will never become empty. In that case, we must modify the algorithm to stop when
the critical region becomes smaller than some desired precision ε> 0.
6.1.2 RANKING THE SET Top k
So far we have discussed how to find the set Top k . Next, we show how to rank (order) it, and, for
that, we describe two algorithms: one is theoretically optimal, while the second one is better and
more convenient in practice. Both rely on Algorithm 4 .
The first algorithm proceeds as follows. It starts with n tuples, t 1 ,...,t n , and computes the
set Top k using Algorithm 4 . Next, using this set of k tuples, it runs the same algorithm again, in
order to retrieve Top k 1 , in the set of tuples Top k ={ t 1 ,...,t k }
. At this point, only one tuple is left
out, in other words the set difference Top k
Top k 1 contains a single tuple: print this tuple as the
last tuple of the ranked list, remove it from the set, decrease k by one, and repeat. The tuple that
remains in the set until the end is the top ranked tuple. Using an argument similar to Theorem 6.4 ,
one can prove that this algorithm is optimal within a factor of 2. Notice, however, that this algorithm
is impractical because it lists the tuples in reverse order, bottom-to-top.
The second algorithm is more practical. This algorithm starts by running Algorithm 4 directly
on the n input tuples, setting k
1. In other words, it starts by computing Top 1 , which identifies the
top tuple in the entire set. Then, it removes it from the set of tuples, decreases n by 1, and computes
again Top 1 on the remaining n 1 tuples. It stops when it has printed k tuples. One can prove that,
in theory, this algorithm can perform arbitrarily worse than an optimal algorithm, if the intervals
[ L i ,U i ]
=
shrink unpredictably in an adversarial manner. However, in practice, it has almost the same
performance as the previous algorithm, and it has the major advantage that the tuples are returned
in decreasing order of the probabilities.
6.2 SEQUENTIAL PROBABILISTIC DATABASES
Sequential probabilistic databases are motivated by the observation that a large fraction of the
world's raw data is sequential and low-level [ IDC , 2007 ], such as text data, hand-written forms,
audio feeds, video feeds, and data from physical sensors. And while this low-level data is rich
in information, the data is difficult for many applications to use directly since applications often
need higher-level information, e.g., the ASCII text corresponding to the low-level image of hand
writing in a form. To make higher-level information available to applications, a popular approach
is to use a statistical model, which infers the higher-level information from the lower-level, raw
data. To achieve high quality, many of these models model sequential dependencies: in equipment
tracking applications, where a piece of equipment is at 9:00am gives us a great deal of information
of where that equipment is located at 9:01am. Sequential probabilistic models are intended to
capture these types of sequential dependencies. While sequential probabilistic databases use many
of the same fundamental techniques as probabilistic relational databases, we ask different types of
queries on sequential probabilistic databases. Also, the probability distributions that are common in
sequential probabilistic databases cannot be represent succinctly as pc-tables and require a different
Search WWH ::




Custom Search