Database Reference
In-Depth Information
2.8
BIBLIOGRAPHIC AND HISTORICAL NOTES
The seminal work on incomplete information by
Imieli nski and Lipski
[
1984
] introduced three
kinds of representation systems. Codd-tables are tables with nulls; v-tables are tables that may
contain variables, also called
marked nulls
6
, in addition to constants; c-tables are v-tables where each
tuple is annotated with a propositional formula. In addition, a c-table may specify a global condition
that restricts the set of possible worlds to those defined by total valuations that satisfy this global
condition. The c-tables in this topic are restricted to contain only constants (no variables), and we
also dropped the global condition.
The early probabilistic data model introduced by [
Barbará et al.
,
1992
] is essentially a BID data
model. While the authors had been inspired by earlier work on incomplete information and c-tables,
a formal connection was not established until very recently.
Green and Tannen
[
2006
] provide a rig-
orous discussion of the relationships between incomplete databases and probabilistic databases, and
they introduced the term
pc-table
, which we also used in this topic. Several researchers have proposed
representation systems for probabilistic databases. The Trio system, discussed by
Benjelloun et al.
[
2006a
,
b
],
Sarmaetal.
[
2006
,
2009b
],
Widom
[
2005
], designs a model for incomplete and proba-
bilistic databases based on maybe-tuples, X-tuples, and lineage expressions, searching for the right
balance of expressiveness and simplicity. Tuple-independent probabilistic databases are discussed
by
Dalvi and Suciu
[
2004
], motivated by queries with approximate predicates, which introduce an
independent event for every potential match. Disjoint-independent probabilistic databases are dis-
cussed by
Andritsos et al.
[
2006
],
Dalvi and Suciu
[
2007c
].
Poole
[
1993
] states that an arbitrary
probability space can be represented by composing primitive independent-disjoint events, which
captures the essence of
Proposition 2.18
; the proposition in the form presented here is mentioned in
Dalvi and Suciu
[
2007c
].
Proposition 2.16
and
Proposition 2.17
seem folklore, but we were not able
to trace them to any specific reference. U-relations have been originally introduced in the context of
the MayBMS project by
Antova et al.
[
2008
].
Several other representation formalisms for probabilistic databases are discussed in the litera-
ture.
World-set decompositions
are a complete representation formalism for uncertain and probabilistic
data [
Antova et al.
,
2007c
,
2009
,
Olteanu et al.
,
2008
]. The decomposition used by this formalism
is a prime factorization of a universal relation representation [
Ullman
,
1990
] of the set of possi-
ble worlds representing the probability space. In their probabilistic form, such decompositions can
be thought of as shallow Bayesian Networks.
Li and Deshpande
[
2009
] define the
And/Xor Tree
Model
, which generalizes BID tables by allowing arbitrary interleaving of
and
(independence) and
xor
(disjointness) relationships between the tuples. The And/Xor Tree model is a special case of
WS-trees
introduced by
Koch and Olteanu
[
2008
] and, subsequently, generalized by
decomposition
trees
[
Olteanu et al.
,
2010
]. Several of the data models for probabilistic databases are surveyed in
the first edited collection of articles on the topic [
Aggarwal
,
2008
]. More recent work considered
extensions of c-tables with continuous probability distributions [
Kennedy and Koch
,
2010
].
6
In a Codd-table, all nulls are distinct variables; in other words, a Codd-table cannot assert that two values are equal.