Conclusion - Probabilistic Databases

Database Reference

In-Depth Information

Conclusion

This topic discusses the state of the art in representation formalisms and query processing

techniques for probabilistic data. Such data are produced by an increasing amount of applications,

such as information extraction, entity resolution, sensor data, financial risk assessment, or scientific

data.

We started by discussing the foundations in incomplete information and possible world seman-

tics and reviewed c-tables, a classic formalism for representing incomplete databases. We then dis-

cussed basic principles for representing large probabilistic databases, by decomposing such databases

into tuple-independent tables, block-independent-disjoint tables, or U-databases. We gave several

examples of how to achieve such a decomposition, and we proved that such a decomposition is

always possible but may incur an exponential blowup.

Then we discussed the query evaluation problem on probabilistic databases and showed that

even if the input is restricted to the simplest model of tuple-independent tables, several queries are

hard for #P.

There are two approaches for evaluating relational queries on probabilistic databases. In ex-

tensional query evaluation , the entire probabilistic inference can be pushed into the database engine

and, therefore, processed as effectively as the evaluation of standard SQL queries. Although exten-

sional evaluation is only possible on safe queries , it can be extremely effective when it works. For an

important class of relational queries, namely Unions of Conjunctive Queries, extensional evaluation

is provably complete: every query that cannot be evaluated extensionally has a data complexity that is

hard for #P. The dichotomy into polynomial time or #P-hard is based entirely on the query's syntax.

In intensional query evaluation , the probabilistic inference is performed over a propositional

formula, called lineage expression : every relational query can be evaluated this way, but the data

complexity depends dramatically on the query and the instance and can be #P-hard in general.

We discussed two approximation methods for intensional query evaluation, which can trade off

the precision of the output probability for increased performance. Intensional query evaluation can

be further refined to query compilation , which means translating the query's lineage into a decision

diagram on which the query's output probability can be computed in linear time. As for safe queries,

there exists various syntactic characterizations that ensure that the compilation into a given target

is tractable.

Finally, we discussed briefly some advanced topics: efficient ranking of the query's answers,

sequential probabilistic databases, Monte Carlo databases, indexes, and materialized views.

Search WWH ::

Custom Search

Home