Overview - Probabilistic Databases

Database Reference

In-Depth Information

1. OVERVIEW

is only the semantics: in practice we use much more compact ways to represent the probabilistic

database, as we discuss in Chapter 2 .

1.2.3 TYPES OF UNCERTAINTY

Two types of uncertainty are used in probabilistic databases: tuple-level uncertainty and attribute-

level uncertainty.

In tuple-level uncertainty , a tuple is a random variable; we do not know whether the tuple

belongs to the database instance or not. The random variable associated to the tuple has a Boolean

domain: it is true when the tuple is present and false if it is absent. Such a tuple is also called a

maybe tuple [ Widom , 2008 ]. In attribute-level uncertainty , the value of an attribute A is uncertain:

for each tuple, the attribute A represents a random variable, and its domain is the set of values that

the attribute may take for that tuple.

We will find it convenient to convert attribute-level uncertainty into tuple-level uncertainty

and consider only tuple-level uncertainty during query processing. This translation is done as follows.

For every tuple t , where the attribute A takes possible values a 1 ,a 2 ,a 3 ,... , we create several clone

tuples t 1 ,t 2 ,t 3 ,... that are identical to t except for the attribute A , whose values are t 1 .A = a 1 ,

t 2 .A = a 2 , etc. Now each tuple t i is uncertain and described by a random variable, and the tuples

t 1 ,t 2 ,... are mutually exclusive. A block of exclusive tuples is also called an X-tuple [ Widom , 2008 ].

1.2.4 TYPES OF PROBABILISTIC DATABASES

The simplest probabilistic database is a tuple-independent database, where the tuples are independent

probabilistic events. Another popular kind is the block independent-disjoint probabilistic database, or

BID, where the tuples are partitioned into blocks, such that all tuples within a block are disjoint (i.e.,

mutually exclusive) events, and all tuples from different blocks are independent events. Attribute level

uncertainty can be naturally represented as a BID table. While sometimes one needs to represent

more complex correlations between the tuples in a database, this is usually achieved by decomposing

the database into independent and disjoint components, in a process much like traditional database

normalization. Another classification of probabilistic databases is into discrete and continuous . In the

former, attributes are discrete random variables; in the latter, they are continuous random variables.

In this topic, we focus on discrete probabilistic databases and discuss the continuous case in a chapter

on advanced techniques ( Section 6.3 ).

1.2.5 QUERY SEMANTICS

Recall that the answer of a query Q on a deterministic database D is a set of tuples, denoted Q(D) .

The semantics of the query on a probabilistic database is a set of pairs (t, p) , where t is a possible tuple,

i.e., it is in the query's answer in one of the possible worlds W , and p is the probability that t is in

Q(W ) when W is chosen randomly from the set of possible worlds. In other words, p represents the

marginal probability of the event “the query returns the answer t ” over the space of possible worlds;

p is sometimes called the marginal probability of the tuple t . In practice, Q returns an ordered set

Search WWH ::

Custom Search

Home