Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

critical region will never become empty. In that case, we must modify the algorithm to stop when

the critical region becomes smaller than some desired precision ε> 0.

6.1.2 RANKING THE SET Top k

So far we have discussed how to find the set Top k . Next, we show how to rank (order) it, and, for

that, we describe two algorithms: one is theoretically optimal, while the second one is better and

more convenient in practice. Both rely on Algorithm 4 .

The first algorithm proceeds as follows. It starts with n tuples, t 1 ,...,t n , and computes the

set Top k using Algorithm 4 . Next, using this set of k tuples, it runs the same algorithm again, in

order to retrieve Top k − 1 , in the set of tuples Top k ={ t 1 ,...,t k }

. At this point, only one tuple is left

out, in other words the set difference Top k −

Top k − 1 contains a single tuple: print this tuple as the

last tuple of the ranked list, remove it from the set, decrease k by one, and repeat. The tuple that

remains in the set until the end is the top ranked tuple. Using an argument similar to Theorem 6.4 ,

one can prove that this algorithm is optimal within a factor of 2. Notice, however, that this algorithm

is impractical because it lists the tuples in reverse order, bottom-to-top.

The second algorithm is more practical. This algorithm starts by running Algorithm 4 directly

on the n input tuples, setting k

1. In other words, it starts by computing Top 1 , which identifies the

top tuple in the entire set. Then, it removes it from the set of tuples, decreases n by 1, and computes

again Top 1 on the remaining n − 1 tuples. It stops when it has printed k tuples. One can prove that,

in theory, this algorithm can perform arbitrarily worse than an optimal algorithm, if the intervals

[ L i ,U i ]

=

shrink unpredictably in an adversarial manner. However, in practice, it has almost the same

performance as the previous algorithm, and it has the major advantage that the tuples are returned

in decreasing order of the probabilities.

6.2 SEQUENTIAL PROBABILISTIC DATABASES

Sequential probabilistic databases are motivated by the observation that a large fraction of the

world's raw data is sequential and low-level [ IDC , 2007 ], such as text data, hand-written forms,

audio feeds, video feeds, and data from physical sensors. And while this low-level data is rich

in information, the data is difficult for many applications to use directly since applications often

need higher-level information, e.g., the ASCII text corresponding to the low-level image of hand

writing in a form. To make higher-level information available to applications, a popular approach

is to use a statistical model, which infers the higher-level information from the lower-level, raw

data. To achieve high quality, many of these models model sequential dependencies: in equipment

tracking applications, where a piece of equipment is at 9:00am gives us a great deal of information

of where that equipment is located at 9:01am. Sequential probabilistic models are intended to

capture these types of sequential dependencies. While sequential probabilistic databases use many

of the same fundamental techniques as probabilistic relational databases, we ask different types of

queries on sequential probabilistic databases. Also, the probability distributions that are common in

sequential probabilistic databases cannot be represent succinctly as pc-tables and require a different

Search WWH ::

Custom Search

Home