Database Reference
In-Depth Information
1. OVERVIEW
is only the semantics: in practice we use much more compact ways to represent the probabilistic
database, as we discuss in Chapter 2 .
1.2.3 TYPES OF UNCERTAINTY
Two types of uncertainty are used in probabilistic databases: tuple-level uncertainty and attribute-
level uncertainty.
In tuple-level uncertainty , a tuple is a random variable; we do not know whether the tuple
belongs to the database instance or not. The random variable associated to the tuple has a Boolean
domain: it is true when the tuple is present and false if it is absent. Such a tuple is also called a
maybe tuple [ Widom , 2008 ]. In attribute-level uncertainty , the value of an attribute A is uncertain:
for each tuple, the attribute A represents a random variable, and its domain is the set of values that
the attribute may take for that tuple.
We will find it convenient to convert attribute-level uncertainty into tuple-level uncertainty
and consider only tuple-level uncertainty during query processing. This translation is done as follows.
For every tuple t , where the attribute A takes possible values a 1 ,a 2 ,a 3 ,... , we create several clone
tuples t 1 ,t 2 ,t 3 ,... that are identical to t except for the attribute A , whose values are t 1 .A = a 1 ,
t 2 .A = a 2 , etc. Now each tuple t i is uncertain and described by a random variable, and the tuples
t 1 ,t 2 ,... are mutually exclusive. A block of exclusive tuples is also called an X-tuple [ Widom , 2008 ].
1.2.4 TYPES OF PROBABILISTIC DATABASES
The simplest probabilistic database is a tuple-independent database, where the tuples are independent
probabilistic events. Another popular kind is the block independent-disjoint probabilistic database, or
BID, where the tuples are partitioned into blocks, such that all tuples within a block are disjoint (i.e.,
mutually exclusive) events, and all tuples from different blocks are independent events. Attribute level
uncertainty can be naturally represented as a BID table. While sometimes one needs to represent
more complex correlations between the tuples in a database, this is usually achieved by decomposing
the database into independent and disjoint components, in a process much like traditional database
normalization. Another classification of probabilistic databases is into discrete and continuous . In the
former, attributes are discrete random variables; in the latter, they are continuous random variables.
In this topic, we focus on discrete probabilistic databases and discuss the continuous case in a chapter
on advanced techniques ( Section 6.3 ).
1.2.5 QUERY SEMANTICS
Recall that the answer of a query Q on a deterministic database D is a set of tuples, denoted Q(D) .
The semantics of the query on a probabilistic database is a set of pairs (t, p) , where t is a possible tuple,
i.e., it is in the query's answer in one of the possible worlds W , and p is the probability that t is in
Q(W ) when W is chosen randomly from the set of possible worlds. In other words, p represents the
marginal probability of the event “the query returns the answer t ” over the space of possible worlds;
p is sometimes called the marginal probability of the tuple t . In practice, Q returns an ordered set
Search WWH ::




Custom Search