Data andQuery Model - Probabilistic Databases

Database Reference

In-Depth Information

2.8

BIBLIOGRAPHIC AND HISTORICAL NOTES

The seminal work on incomplete information by Imieli nski and Lipski [ 1984 ] introduced three

kinds of representation systems. Codd-tables are tables with nulls; v-tables are tables that may

contain variables, also called marked nulls 6 , in addition to constants; c-tables are v-tables where each

tuple is annotated with a propositional formula. In addition, a c-table may specify a global condition

that restricts the set of possible worlds to those defined by total valuations that satisfy this global

condition. The c-tables in this topic are restricted to contain only constants (no variables), and we

also dropped the global condition.

The early probabilistic data model introduced by [ Barbará et al. , 1992 ] is essentially a BID data

model. While the authors had been inspired by earlier work on incomplete information and c-tables,

a formal connection was not established until very recently. Green and Tannen [ 2006 ] provide a rig-

orous discussion of the relationships between incomplete databases and probabilistic databases, and

they introduced the term pc-table , which we also used in this topic. Several researchers have proposed

representation systems for probabilistic databases. The Trio system, discussed by Benjelloun et al.

[ 2006a , b ], Sarmaetal. [ 2006 , 2009b ], Widom [ 2005 ], designs a model for incomplete and proba-

bilistic databases based on maybe-tuples, X-tuples, and lineage expressions, searching for the right

balance of expressiveness and simplicity. Tuple-independent probabilistic databases are discussed

by Dalvi and Suciu [ 2004 ], motivated by queries with approximate predicates, which introduce an

independent event for every potential match. Disjoint-independent probabilistic databases are dis-

cussed by Andritsos et al. [ 2006 ], Dalvi and Suciu [ 2007c ]. Poole [ 1993 ] states that an arbitrary

probability space can be represented by composing primitive independent-disjoint events, which

captures the essence of Proposition 2.18 ; the proposition in the form presented here is mentioned in

Dalvi and Suciu [ 2007c ]. Proposition 2.16 and Proposition 2.17 seem folklore, but we were not able

to trace them to any specific reference. U-relations have been originally introduced in the context of

the MayBMS project by Antova et al. [ 2008 ].

Several other representation formalisms for probabilistic databases are discussed in the litera-

ture. World-set decompositions are a complete representation formalism for uncertain and probabilistic

data [ Antova et al. , 2007c , 2009 , Olteanu et al. , 2008 ]. The decomposition used by this formalism

is a prime factorization of a universal relation representation [ Ullman , 1990 ] of the set of possi-

ble worlds representing the probability space. In their probabilistic form, such decompositions can

be thought of as shallow Bayesian Networks. Li and Deshpande [ 2009 ] define the And/Xor Tree

Model , which generalizes BID tables by allowing arbitrary interleaving of and (independence) and

xor (disjointness) relationships between the tuples. The And/Xor Tree model is a special case of

WS-trees introduced by Koch and Olteanu [ 2008 ] and, subsequently, generalized by decomposition

trees [ Olteanu et al. , 2010 ]. Several of the data models for probabilistic databases are surveyed in

the first edited collection of articles on the topic [ Aggarwal , 2008 ]. More recent work considered

extensions of c-tables with continuous probability distributions [ Kennedy and Koch , 2010 ].

6 In a Codd-table, all nulls are distinct variables; in other words, a Codd-table cannot assert that two values are equal.

Search WWH ::

Custom Search

Home