Database Reference
In-Depth Information
CHAPTER
6
Advanced Techniques
This chapter includes a selection of advanced topics that extend the model discussed in previous
chapters in several ways: top- k processing, sequential databases, Monte Carlo databases, and indexes
for speeding up query processing. This material is more advanced and treated at a higher level
(meaning with less detail) than the material in the previous chapters: the reader is referred to the
bibliographic references for more details.
We begin by discussing top- k processing, which is a practical optimization to query processing
in probabilistic databases. We then present sequential probabilistic databases that allow us to apply
probabilistic database ideas to sophisticated text processing applications or to applications that use
sensor data, audio data, or optical character recognition data. We then discuss more sophisticated
generalizations that allow us to capture deep analytic applications by us finely capturing statistical
correlations and dependencies, such as Markov Network-based Databases and Monte Carlo-based
databases. Finally, we discuss some novel techniques for indexing probabilistic databases, with ap-
plications both to sequential databases, and databases representing large graphical models.
6.1
TOP- k QUERY ANSWERING
We have discussed so far, in the previous two chapters, how to compute the output probabilities
exactly or approximately. We now discuss a way to avoid computing most of the output probabilities,
namely when the user asks for the top k highest ranked answers.
The justification for the approach discussed here is that, in many applications of probabilistic
databases, the probabilities are mere degrees of uncertainty in the data, and do not have any otherwise
semantics that is meaningful to the user. Instead, users care only about ranking the query answers and
want to retrieve only the top k answers: if the outputs are (t 1 ,p 1 ),...,(t n ,p n ) , and p 1 ... p n ,
then the system should display only the top k answers, in decreasing order of their probabilities. In
Section 1.1 , we showed a SQL query on the NELL probabilistic database: its answers are ranked 1
in decreasing order of their probabilities. The actual probabilities may, or may not be printed: the
ranking is the most important information for the user. Also, usually k n : for example, k = 10
while n can be in the thousands. It seems a waste of resources to compute all n probabilities, only to
return the top k .
When the probabilities can be computed exactly and efficiently, then one should just compute
all n output probabilities, sort them, and return the top k . We assume in this section that the
probabilities are expensive to compute exactly; instead, we approximate them, through some iterative
1 Ranking output tuples should not be confused with ranking attributes, in Subsection 4.1.2 .
Search WWH ::




Custom Search