Database Reference
In-Depth Information
Conclusion
This topic discusses the state of the art in representation formalisms and query processing
techniques for probabilistic data. Such data are produced by an increasing amount of applications,
such as information extraction, entity resolution, sensor data, financial risk assessment, or scientific
data.
We started by discussing the foundations in incomplete information and possible world seman-
tics and reviewed c-tables, a classic formalism for representing incomplete databases. We then dis-
cussed basic principles for representing large probabilistic databases, by decomposing such databases
into tuple-independent tables, block-independent-disjoint tables, or U-databases. We gave several
examples of how to achieve such a decomposition, and we proved that such a decomposition is
always possible but may incur an exponential blowup.
Then we discussed the query evaluation problem on probabilistic databases and showed that
even if the input is restricted to the simplest model of tuple-independent tables, several queries are
hard for #P.
There are two approaches for evaluating relational queries on probabilistic databases. In ex-
tensional query evaluation , the entire probabilistic inference can be pushed into the database engine
and, therefore, processed as effectively as the evaluation of standard SQL queries. Although exten-
sional evaluation is only possible on safe queries , it can be extremely effective when it works. For an
important class of relational queries, namely Unions of Conjunctive Queries, extensional evaluation
is provably complete: every query that cannot be evaluated extensionally has a data complexity that is
hard for #P. The dichotomy into polynomial time or #P-hard is based entirely on the query's syntax.
In intensional query evaluation , the probabilistic inference is performed over a propositional
formula, called lineage expression : every relational query can be evaluated this way, but the data
complexity depends dramatically on the query and the instance and can be #P-hard in general.
We discussed two approximation methods for intensional query evaluation, which can trade off
the precision of the output probability for increased performance. Intensional query evaluation can
be further refined to query compilation , which means translating the query's lineage into a decision
diagram on which the query's output probability can be computed in linear time. As for safe queries,
there exists various syntactic characterizations that ensure that the compilation into a given target
is tractable.
Finally, we discussed briefly some advanced topics: efficient ranking of the query's answers,
sequential probabilistic databases, Monte Carlo databases, indexes, and materialized views.
Search WWH ::




Custom Search