Database Reference
In-Depth Information
1. OVERVIEW
is only the semantics: in practice we use much more compact ways to represent the probabilistic
database, as we discuss in
Chapter 2
.
1.2.3 TYPES OF UNCERTAINTY
Two types of uncertainty are used in probabilistic databases: tuple-level uncertainty and attribute-
level uncertainty.
In
tuple-level uncertainty
, a tuple is a random variable; we do not know whether the tuple
belongs to the database instance or not. The random variable associated to the tuple has a Boolean
domain: it is
true
when the tuple is present and
false
if it is absent. Such a tuple is also called a
maybe tuple
[
Widom
,
2008
]. In
attribute-level uncertainty
, the value of an attribute
A
is uncertain:
for each tuple, the attribute
A
represents a random variable, and its domain is the set of values that
the attribute may take for that tuple.
We will find it convenient to convert attribute-level uncertainty into tuple-level uncertainty
and consider only tuple-level uncertainty during query processing. This translation is done as follows.
For every tuple
t
, where the attribute
A
takes possible values
a
1
,a
2
,a
3
,...
, we create several clone
tuples
t
1
,t
2
,t
3
,...
that are identical to
t
except for the attribute
A
, whose values are
t
1
.A
=
a
1
,
t
2
.A
=
a
2
, etc. Now each tuple
t
i
is uncertain and described by a random variable, and the tuples
t
1
,t
2
,...
are mutually exclusive. A block of exclusive tuples is also called an
X-tuple
[
Widom
,
2008
].
1.2.4 TYPES OF PROBABILISTIC DATABASES
The simplest probabilistic database is a
tuple-independent
database, where the tuples are independent
probabilistic events. Another popular kind is the
block independent-disjoint
probabilistic database, or
BID, where the tuples are partitioned into blocks, such that all tuples within a block are disjoint (i.e.,
mutually exclusive) events, and all tuples from different blocks are independent events. Attribute level
uncertainty can be naturally represented as a BID table. While sometimes one needs to represent
more complex correlations between the tuples in a database, this is usually achieved by decomposing
the database into independent and disjoint components, in a process much like traditional database
normalization. Another classification of probabilistic databases is into
discrete
and
continuous
. In the
former, attributes are discrete random variables; in the latter, they are continuous random variables.
In this topic, we focus on discrete probabilistic databases and discuss the continuous case in a chapter
on advanced techniques (
Section 6.3
).
1.2.5 QUERY SEMANTICS
Recall that the answer of a query
Q
on a
deterministic
database
D
is a set of tuples, denoted
Q(D)
.
The semantics of the query on a
probabilistic
database is a set of pairs
(t, p)
, where
t
is a possible tuple,
i.e., it is in the query's answer in one of the possible worlds
W
, and
p
is the probability that
t
is in
Q(W )
when
W
is chosen randomly from the set of possible worlds. In other words,
p
represents the
marginal probability of the event “the query returns the answer
t
” over the space of possible worlds;
p
is sometimes called the
marginal probability
of the tuple
t
. In practice,
Q
returns an ordered set