Data andQuery Model - Probabilistic Databases

Database Reference

In-Depth Information

2. DATA AND QUERY MODEL

R

FID

SSN

N

M

351

null

Smith

null

352

null

Brown

null

Using nulls, information about the values considered possible for the various fields is lost.

Moreover, it is not possible to express correlations such as that while social security numbers may be

uncertain, no two distinct individuals can have the same. In this example, we want to exclude the case

that both Smith and Brown have social security number 185. Finally, we cannot store probabilities

for the various alternative possible worlds.

An alternative approach is to explicitly store all the possible readings, one relation instance

per reading. The most striking problem of this approach is the potentially large number of readings.

If we conduct a survey of 50 questions on a population of 200 million and we assume that one in

10 4 answers can be read in just two different ways, we get 2 10 6 possible readings. We cannot store

all these readings explicitly; instead, we need to search for a more compact representation.

Example 2.1 Similar to the Google Squared representation in Chapter 1, we can represent the

available information by inlining within each field all of its possibilities (here, without probabilities).

R

FID

SSN

N

M

351

{ 185, 785 }

Smith

{ 1, 2 }

352

{ 185, 186 }

Brown

{ 1, 2, 3, 4 }

This representation is more compact, yet it cannot account for correlations across possible

readings of different fields, such as when we know that no two persons can have the same social

security number.

In this chapter, we introduce formalisms that are able to compactly represent uncertain data,

and start by defining the semantics of a probabilistic database. Throughout our discussion, we will

consider incomplete databases , which allow for multiple different states of the database, and probabilistic

databases , which specify in addition a probability distribution on those states.

Informally, our model is the following: fix a relational database schema. An incomplete database

is a finite set of database instances of that schema (called possible worlds). A probabilistic database

is also a finite set of possible worlds, where each world has a weight (called probability) between 0

and 1 and the weights of all worlds sum up to 1. In a subjectivist Bayesian interpretation, one of the

possible worlds is “true”, but we do not know which one, and the probabilities represent degrees of

belief in the various possible worlds. In our census scenario, the probabilistic database consists of

one world for each possible reading, which is weighted according to its likelihood.

Definition 2.2 Fix a relational schema with k relation names, R 1 ,...,R k .An incomplete database

is a finite set of structures W

={ W 1 ,W 2 ,...,W n

, where each W i

is a database instance, W i

}

=

R 1 ,...,R k

, called a possible world .

Probabilistic Databases

Search WWH ::

Custom Search

Home