Database Reference
In-Depth Information
2. DATA AND QUERY MODEL
Graphical model are extensively discussed in the influential topic by
Pearl
[
1989
] and in recent
topics by
Koller and Friedman
[
2009
] and
Darwiche
[
2009
].
Senetal.
[
2009
] investigate the use
of graphical models in probabilistic databases. The
Probabilistic Relational Model
(PRM) introduced
by
Koller
[
1999
] and
Friedman et al.
[
1999
] is, in essence, a large Bayesian Network represented
as a database where the probabilistic structure is captured at the schema level. For example, the
Bayesian network may define a probability distribution for the Disease attribute, conditioned on other
attributes such as Age, Symptoms, and Ethnicity; then, the PRM replicates this Bayesian network
once for very Patient record in a database of patients. An important contribution of PRMs is that they
allow the Bayesian Network to refer to keys and foreign keys; for example, the probability distribution
on Diseases may also depend on the diseases of a patient's friends, which are obtained by following
foreign keys. Bayesian networks were applied to optimize the representation of multidimensional
histograms by
Getoor et al.
[
2001
]. In similar spirit,
Deshpande et al.
[
2001
] describe how Markov
Networks can be used to optimize multidimensional histograms.
The fundamental connection between normalization theory and factor decomposition in
graphical models has been discussed by
Verma and Pearl
[
1988
] but, apparently, has not been ex-
plored since then. To date, there is no formal design theory for probabilistic databases; a step towards
this direction is taken by
Sarmaetal.
[
2009a
], who discuss functional dependencies for uncertain
databases.
The query semantics based on possible worlds that we introduced in
Section 2.3
is similar to
intensional semantics
discussed by
Fuhr and Rölleke
[
1997
]. While some early work on probabilistic
databases by
Lakshmanan et al.
[
1997
] and
Dey and Sarkar
[
1996
], use a simpler, less precise se-
mantics, all recent work on probabilistic databases follows the possible world semantics for query
evaluation, with the exception of the work by
Li and Deshpande
[
2009
], who propose an alternative
query semantics based on the notion of a consensus answer; this is a deterministic answer world that
minimizes the expected distance to the possible worlds (answers), and the work by
Gatterbauer et al.
[
2010
] who propose a semantics called
propagation
that can always be evaluated efficiently.
Throughout this topic, we consider only queries that are written against a deterministic
database. That is, the user writes the query with a deterministic database in mind, then the query
is evaluated against a probabilistic database, by evaluating it on every possible world, as we dis-
cussed in
Section 2.3
. This is a restriction; in practice, one would like to have a query language that
allows the user to query the probability distribution itself or to generate new uncertainty from a
certain value. Several languages have been studied that go beyond these restrictions and add consid-
erable expressive power. For example, computing conditional probabilities, maximum likelihoods,
or maximum-a-posteriori (MAP) values on a probabilistic database can be supported by query lan-
guages that support probabilistic subqueries and aggregates [
Koch
,
2008c
,
Koch and Olteanu
,
2008
].
This additional power does not necessarily come at high cost. For example, conditional probabili-
ties are simply ratios of probabilities computable using the techniques studied in the next section.
Compositional languages for probabilistic databases are supported by both the Trio system [
Widom
,
2008
] and the MayBMS system [
Antova et al.
,
2007a
,
b
,
Koch
,
2008c
]. In both cases, probabilistic