Database Reference
In-Depth Information
2. DATA AND QUERY MODEL
R
FID
SSN
N
M
351
null
Smith
null
352
null
Brown
null
Using nulls, information about the values considered possible for the various fields is lost.
Moreover, it is not possible to express correlations such as that while social security numbers may be
uncertain, no two distinct individuals can have the same. In this example, we want to exclude the case
that both Smith and Brown have social security number 185. Finally, we cannot store probabilities
for the various alternative possible worlds.
An alternative approach is to explicitly store all the possible readings, one relation instance
per reading. The most striking problem of this approach is the potentially large number of readings.
If we conduct a survey of 50 questions on a population of 200 million and we assume that one in
10 4 answers can be read in just two different ways, we get 2 10 6 possible readings. We cannot store
all these readings explicitly; instead, we need to search for a more compact representation.
Example 2.1 Similar to the Google Squared representation in Chapter 1, we can represent the
available information by inlining within each field all of its possibilities (here, without probabilities).
R
FID
SSN
N
M
351
{ 185, 785 }
Smith
{ 1, 2 }
352
{ 185, 186 }
Brown
{ 1, 2, 3, 4 }
This representation is more compact, yet it cannot account for correlations across possible
readings of different fields, such as when we know that no two persons can have the same social
security number.
In this chapter, we introduce formalisms that are able to compactly represent uncertain data,
and start by defining the semantics of a probabilistic database. Throughout our discussion, we will
consider incomplete databases , which allow for multiple different states of the database, and probabilistic
databases , which specify in addition a probability distribution on those states.
Informally, our model is the following: fix a relational database schema. An incomplete database
is a finite set of database instances of that schema (called possible worlds). A probabilistic database
is also a finite set of possible worlds, where each world has a weight (called probability) between 0
and 1 and the weights of all worlds sum up to 1. In a subjectivist Bayesian interpretation, one of the
possible worlds is “true”, but we do not know which one, and the probabilities represent degrees of
belief in the various possible worlds. In our census scenario, the probabilistic database consists of
one world for each possible reading, which is weighted according to its likelihood.
Definition 2.2 Fix a relational schema with k relation names, R 1 ,...,R k .An incomplete database
is a finite set of structures W
={ W 1 ,W 2 ,...,W n
, where each W i
is a database instance, W i
}
=
R 1 ,...,R k
, called a possible world .
 
Search WWH ::




Custom Search