Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

CHAPTER

6

Advanced Techniques

This chapter includes a selection of advanced topics that extend the model discussed in previous

chapters in several ways: top- k processing, sequential databases, Monte Carlo databases, and indexes

for speeding up query processing. This material is more advanced and treated at a higher level

(meaning with less detail) than the material in the previous chapters: the reader is referred to the

bibliographic references for more details.

We begin by discussing top- k processing, which is a practical optimization to query processing

in probabilistic databases. We then present sequential probabilistic databases that allow us to apply

probabilistic database ideas to sophisticated text processing applications or to applications that use

sensor data, audio data, or optical character recognition data. We then discuss more sophisticated

generalizations that allow us to capture deep analytic applications by us finely capturing statistical

correlations and dependencies, such as Markov Network-based Databases and Monte Carlo-based

databases. Finally, we discuss some novel techniques for indexing probabilistic databases, with ap-

plications both to sequential databases, and databases representing large graphical models.

6.1

TOP- k QUERY ANSWERING

We have discussed so far, in the previous two chapters, how to compute the output probabilities

exactly or approximately. We now discuss a way to avoid computing most of the output probabilities,

namely when the user asks for the top k highest ranked answers.

The justification for the approach discussed here is that, in many applications of probabilistic

databases, the probabilities are mere degrees of uncertainty in the data, and do not have any otherwise

semantics that is meaningful to the user. Instead, users care only about ranking the query answers and

want to retrieve only the top k answers: if the outputs are (t 1 ,p 1 ),...,(t n ,p n ) , and p 1 ≥ ... ≥ p n ,

then the system should display only the top k answers, in decreasing order of their probabilities. In

Section 1.1 , we showed a SQL query on the NELL probabilistic database: its answers are ranked 1

in decreasing order of their probabilities. The actual probabilities may, or may not be printed: the

ranking is the most important information for the user. Also, usually k n : for example, k = 10

while n can be in the thousands. It seems a waste of resources to compute all n probabilities, only to

return the top k .

When the probabilities can be computed exactly and efficiently, then one should just compute

all n output probabilities, sort them, and return the top k . We assume in this section that the

probabilities are expensive to compute exactly; instead, we approximate them, through some iterative

1 Ranking output tuples should not be confused with ranking attributes, in Subsection 4.1.2 .

Search WWH ::

Custom Search

Home