Data andQuery Model - Probabilistic Databases

Database Reference

In-Depth Information

2. DATA AND QUERY MODEL

Graphical model are extensively discussed in the influential topic by Pearl [ 1989 ] and in recent

topics by Koller and Friedman [ 2009 ] and Darwiche [ 2009 ]. Senetal. [ 2009 ] investigate the use

of graphical models in probabilistic databases. The Probabilistic Relational Model (PRM) introduced

by Koller [ 1999 ] and Friedman et al. [ 1999 ] is, in essence, a large Bayesian Network represented

as a database where the probabilistic structure is captured at the schema level. For example, the

Bayesian network may define a probability distribution for the Disease attribute, conditioned on other

attributes such as Age, Symptoms, and Ethnicity; then, the PRM replicates this Bayesian network

once for very Patient record in a database of patients. An important contribution of PRMs is that they

allow the Bayesian Network to refer to keys and foreign keys; for example, the probability distribution

on Diseases may also depend on the diseases of a patient's friends, which are obtained by following

foreign keys. Bayesian networks were applied to optimize the representation of multidimensional

histograms by Getoor et al. [ 2001 ]. In similar spirit, Deshpande et al. [ 2001 ] describe how Markov

Networks can be used to optimize multidimensional histograms.

The fundamental connection between normalization theory and factor decomposition in

graphical models has been discussed by Verma and Pearl [ 1988 ] but, apparently, has not been ex-

plored since then. To date, there is no formal design theory for probabilistic databases; a step towards

this direction is taken by Sarmaetal. [ 2009a ], who discuss functional dependencies for uncertain

databases.

The query semantics based on possible worlds that we introduced in Section 2.3 is similar to

intensional semantics discussed by Fuhr and Rölleke [ 1997 ]. While some early work on probabilistic

databases by Lakshmanan et al. [ 1997 ] and Dey and Sarkar [ 1996 ], use a simpler, less precise se-

mantics, all recent work on probabilistic databases follows the possible world semantics for query

evaluation, with the exception of the work by Li and Deshpande [ 2009 ], who propose an alternative

query semantics based on the notion of a consensus answer; this is a deterministic answer world that

minimizes the expected distance to the possible worlds (answers), and the work by Gatterbauer et al.

[ 2010 ] who propose a semantics called propagation that can always be evaluated efficiently.

Throughout this topic, we consider only queries that are written against a deterministic

database. That is, the user writes the query with a deterministic database in mind, then the query

is evaluated against a probabilistic database, by evaluating it on every possible world, as we dis-

cussed in Section 2.3 . This is a restriction; in practice, one would like to have a query language that

allows the user to query the probability distribution itself or to generate new uncertainty from a

certain value. Several languages have been studied that go beyond these restrictions and add consid-

erable expressive power. For example, computing conditional probabilities, maximum likelihoods,

or maximum-a-posteriori (MAP) values on a probabilistic database can be supported by query lan-

guages that support probabilistic subqueries and aggregates [ Koch , 2008c , Koch and Olteanu , 2008 ].

This additional power does not necessarily come at high cost. For example, conditional probabili-

ties are simply ratios of probabilities computable using the techniques studied in the next section.

Compositional languages for probabilistic databases are supported by both the Trio system [ Widom ,

2008 ] and the MayBMS system [ Antova et al. , 2007a , b , Koch , 2008c ]. In both cases, probabilistic

Search WWH ::

Custom Search

Home