Database Reference
In-Depth Information
2. DATA AND QUERY MODEL
Graphical model are extensively discussed in the influential topic by Pearl [ 1989 ] and in recent
topics by Koller and Friedman [ 2009 ] and Darwiche [ 2009 ]. Senetal. [ 2009 ] investigate the use
of graphical models in probabilistic databases. The Probabilistic Relational Model (PRM) introduced
by Koller [ 1999 ] and Friedman et al. [ 1999 ] is, in essence, a large Bayesian Network represented
as a database where the probabilistic structure is captured at the schema level. For example, the
Bayesian network may define a probability distribution for the Disease attribute, conditioned on other
attributes such as Age, Symptoms, and Ethnicity; then, the PRM replicates this Bayesian network
once for very Patient record in a database of patients. An important contribution of PRMs is that they
allow the Bayesian Network to refer to keys and foreign keys; for example, the probability distribution
on Diseases may also depend on the diseases of a patient's friends, which are obtained by following
foreign keys. Bayesian networks were applied to optimize the representation of multidimensional
histograms by Getoor et al. [ 2001 ]. In similar spirit, Deshpande et al. [ 2001 ] describe how Markov
Networks can be used to optimize multidimensional histograms.
The fundamental connection between normalization theory and factor decomposition in
graphical models has been discussed by Verma and Pearl [ 1988 ] but, apparently, has not been ex-
plored since then. To date, there is no formal design theory for probabilistic databases; a step towards
this direction is taken by Sarmaetal. [ 2009a ], who discuss functional dependencies for uncertain
databases.
The query semantics based on possible worlds that we introduced in Section 2.3 is similar to
intensional semantics discussed by Fuhr and Rölleke [ 1997 ]. While some early work on probabilistic
databases by Lakshmanan et al. [ 1997 ] and Dey and Sarkar [ 1996 ], use a simpler, less precise se-
mantics, all recent work on probabilistic databases follows the possible world semantics for query
evaluation, with the exception of the work by Li and Deshpande [ 2009 ], who propose an alternative
query semantics based on the notion of a consensus answer; this is a deterministic answer world that
minimizes the expected distance to the possible worlds (answers), and the work by Gatterbauer et al.
[ 2010 ] who propose a semantics called propagation that can always be evaluated efficiently.
Throughout this topic, we consider only queries that are written against a deterministic
database. That is, the user writes the query with a deterministic database in mind, then the query
is evaluated against a probabilistic database, by evaluating it on every possible world, as we dis-
cussed in Section 2.3 . This is a restriction; in practice, one would like to have a query language that
allows the user to query the probability distribution itself or to generate new uncertainty from a
certain value. Several languages have been studied that go beyond these restrictions and add consid-
erable expressive power. For example, computing conditional probabilities, maximum likelihoods,
or maximum-a-posteriori (MAP) values on a probabilistic database can be supported by query lan-
guages that support probabilistic subqueries and aggregates [ Koch , 2008c , Koch and Olteanu , 2008 ].
This additional power does not necessarily come at high cost. For example, conditional probabili-
ties are simply ratios of probabilities computable using the techniques studied in the next section.
Compositional languages for probabilistic databases are supported by both the Trio system [ Widom ,
2008 ] and the MayBMS system [ Antova et al. , 2007a , b , Koch , 2008c ]. In both cases, probabilistic
Search WWH ::




Custom Search