Database Reference
In-Depth Information
of pairs
(t
1
,p
1
), (t
2
,p
2
),...
where
t
1
,t
2
,...
are distinct tuples and
p
1
,p
2
,...
are their marginal
probabilities, such that the answers are ranked by
p
1
≥
p
2
≥
...
This semantics does not report how distinct tuples are correlated. Thus, we know that
t
1
is
an answer with probability
p
1
and that
t
2
is an answer with probability
p
2
, but we do not know the
probability that
both t
1
and
t
2
are answers. This probability can be 0 if
t
1
,t
2
are mutually exclusive;
it can be
p
1
p
2
if
t
1
and
t
2
are independent events; or it can be min
(p
1
,p
2
)
if the set of worlds
where one of the tuples is an answer is contained in the set of worlds where the other tuple is an
answer. The probability that both
t
1
,t
2
occur in the answer minus
p
1
p
2
is called the
covariance
4
of
t
1
,t
2
: the tuples are independent iff the covariance is 0. Probabilistic database systems prefer to
drop any correlation information from the query's output because it is difficult to represent for a
large number of tuples. Users, however, can still inquire about the correlation, by asking explicitly
for the probability of both
t
1
and
t
2
. For example, consider the earlier query
Retrieve all products
manufactured by a company headquartered in
San Jose
, and suppose we want to know the probability
that both
adobe_indesign
and
adobe_dreamweaver
are in the answer. We can compute that by
running a second query,
Retrieve all pairs of products, each manufactured by a company headquartered in
San Jose
, and looking up the probability of the pair
adobe_indesign
,
adobe_dreamweaver
in
the answer. Thus, while one single query does not convey information about the correlations between
the tuples in its answer, this information can always be obtained later, by asking additional, more
complex queries on the probabilistic database.
1.2.6 LINEAGE
The
lineage
of a possible output tuple to a query is a propositional formula over the input tuples in
the database, which says which input tuples must be present in order for the query to return that
output. Consider again the query “Retrieve all products manufactured by a company headquartered
in
San Jose
” on the database in
Figure 1.2
. The output tuple (
personal_computer
,
ibm
) has
lineage expression
X
1
∧
Y
1
, where
X
1
and
Y
1
represent the two input tuples shown in
Figure 1.2
;
this is because both
X
1
and
Y
1
must be in the database to ensure that output. For another example,
consider the query
find all cities that are headquarters of some companies
: the answer
san_jose
has
lineage
Y
1
∨
Y
2
because any of
Y
1
or
Y
2
are sufficient to produce the answer
san_jose
. Query
evaluation on probabilistic databases essentially reduces to the problem of computing the probability
of propositional formulas, representing lineage expressions. We discuss this in detail in
Chapter 2
.
We note that the term “lineage” is sometimes used in the literature with slightly different and
not always consistent meanings. In this topic, we will use the term lineage to denote a propositional
formula. It corresponds to the
PosBool
provenance semiring of
Green et al.
[
2007
], which is the
semiring of positive Boolean expressions, except that we also allow negation.
4
The
covariance
of
two
random
variables
X
1
,X
2
is
cov(X
1
,X
2
)
=
E
[
(X
1
−
μ
X
1
)
·
(X
2
−
μ
X
2
)
]
. In
our
case,
X
i
∈
{
0
,
1
}
is the random variable representing the absence/presence of the tuple
t
i
,
μ
X
i
=
E
[
X
i
]=
p
i
; hence, the co-
variance is
E
[
(X
1
−
μ
X
1
)
·
(X
2
−
μ
X
2
)
]=
E
[
X
1
·
X
2
]−
E
[
X
1
]·
p
2
−
p
1
·
E
[
X
2
]+
p
1
·
p
2
=
E
[
X
1
·
X
2
]−
p
1
·
p
2
−
p
1
·
p
2
+
p
1
·
p
2
=
E
[
X
1
·
X
2
]−
p
1
·
p
2
=
P(t
1
∈
Q(W )
∧
t
2
∈
Q(W ))
−
P(t
1
∈
Q(W ))
·
P(t
2
∈
Q(W ))
.