Database Reference
In-Depth Information
complex correlations between the random variables. Thus, the probabilistic model in databases is
simple in the sense that there are no correlations at all, or only disjoint events.
Second, the notion of a query is quite different. In GMs, the query is simple: it asks for the
probability of some output variables given some evidence; a typical query is P(X 1 X 3 | X 2 X 5 X 7 ) ,
which asks for the probability of (certain values of ) the random variables X 1 ,X 3 , given the evidence
(values for) X 2 ,X 5 ,X 7 . In probabilistic databases, the query is complex: it is an expression in the
Relational Calculus, or in SQL, as we have illustrated over the NELL database.
Third, the network in GMs depends only on the data and is independent on the query, while
in probabilistic databases the network depends on both the data and the query. Thus, the network in
GMs is static while in probabilistic databases it is dynamic. The network in probabilistic databases is
the query's lineage, obtained from both the databases instance and the query and may be both large
(because the database is large) and complex (because the query is complex). The distinction between
a static network in GM and a dynamic network in probabilistic databases affects dramatically our
approach to probabilistic inference.
The complexity of the probabilistic inference problem is measured in terms of the size of the
network (for GMs) and in the size of the database (for probabilistic databases). In this respect, the
network in GMs is analogous to the database instance in databases. However, the key parameter
influencing the complexity is different. In GM, the main complexity parameter is the network's
treewidth; all probabilistic inference algorithms for GM run in time that is exponential in the
treewidth of the network. In probabilistic databases, the main complexity parameter is the query:
we fix the query, then ask for the complexity of probabilistic inference in terms of the size of the
database instance. This is called data complexity by Va rd i [ 1982 ]. We will show that, depending on
the query, the data complexity can range from polynomial time to #P-hard.
Finally, probabilistic databases are an evolution of standard, relational database. In particular,
they must use techniques that integrate smoothly with existing query processing techniques, such
as indexes, cost-based query optimizations, the use of database statistics, and parallelization. This
requires both a conceptual approach to probabilistic inference that is consistent with standard query
evaluation and a significant engineering effort to integrate this probabilistic inference with a relational
database system. In contrast, probabilistic inference algorithms for GM are stand-alone, and they
are currently not integrated with relational query processing systems.
1.2.8 SAFE QUERIES, SAFE QUERY PLANS, AND THE DICHOTOMY
An extensional query plan is a query plan that manipulates probabilities explicitly and computes both
the answers and probabilities. Two popular examples of extended operators in extensional query
plans are the independent join operator,
i , which multiplies the probabilities of the tuples it joins,
under the assumption that they are independent, and the independent project operator, i , which
computes the probability of an output tuple t as 1
p n ) where p 1 ,...,p n are
the probabilities of all tuples that project into t , again assuming that these tuples are independent.
In general, an extensional plan does not compute the query probabilities correctly. If the plan does
( 1
p 1 ) ··· ( 1
Search WWH ::




Custom Search