Database Reference
In-Depth Information
of pairs (t 1 ,p 1 ), (t 2 ,p 2 ),... where t 1 ,t 2 ,... are distinct tuples and p 1 ,p 2 ,... are their marginal
probabilities, such that the answers are ranked by p 1 p 2 ...
This semantics does not report how distinct tuples are correlated. Thus, we know that t 1 is
an answer with probability p 1 and that t 2 is an answer with probability p 2 , but we do not know the
probability that both t 1 and t 2 are answers. This probability can be 0 if t 1 ,t 2 are mutually exclusive;
it can be p 1 p 2 if t 1 and t 2 are independent events; or it can be min (p 1 ,p 2 ) if the set of worlds
where one of the tuples is an answer is contained in the set of worlds where the other tuple is an
answer. The probability that both t 1 ,t 2 occur in the answer minus p 1 p 2 is called the covariance 4
of t 1 ,t 2 : the tuples are independent iff the covariance is 0. Probabilistic database systems prefer to
drop any correlation information from the query's output because it is difficult to represent for a
large number of tuples. Users, however, can still inquire about the correlation, by asking explicitly
for the probability of both t 1 and t 2 . For example, consider the earlier query Retrieve all products
manufactured by a company headquartered in San Jose , and suppose we want to know the probability
that both adobe_indesign and adobe_dreamweaver are in the answer. We can compute that by
running a second query, Retrieve all pairs of products, each manufactured by a company headquartered in
San Jose , and looking up the probability of the pair adobe_indesign , adobe_dreamweaver in
the answer. Thus, while one single query does not convey information about the correlations between
the tuples in its answer, this information can always be obtained later, by asking additional, more
complex queries on the probabilistic database.
1.2.6 LINEAGE
The lineage of a possible output tuple to a query is a propositional formula over the input tuples in
the database, which says which input tuples must be present in order for the query to return that
output. Consider again the query “Retrieve all products manufactured by a company headquartered
in San Jose ” on the database in Figure 1.2 . The output tuple ( personal_computer , ibm ) has
lineage expression X 1 Y 1 , where X 1 and Y 1 represent the two input tuples shown in Figure 1.2 ;
this is because both X 1 and Y 1 must be in the database to ensure that output. For another example,
consider the query find all cities that are headquarters of some companies : the answer san_jose has
lineage Y 1 Y 2 because any of Y 1 or Y 2 are sufficient to produce the answer san_jose . Query
evaluation on probabilistic databases essentially reduces to the problem of computing the probability
of propositional formulas, representing lineage expressions. We discuss this in detail in Chapter 2 .
We note that the term “lineage” is sometimes used in the literature with slightly different and
not always consistent meanings. In this topic, we will use the term lineage to denote a propositional
formula. It corresponds to the PosBool provenance semiring of Green et al. [ 2007 ], which is the
semiring of positive Boolean expressions, except that we also allow negation.
4 The
covariance
of
two
random
variables X 1 ,X 2
is cov(X 1 ,X 2 ) = E [ (X 1 μ X 1 ) · (X 2 μ X 2 ) ]
. In
our
case, X i
{ 0 , 1 }
is the random variable representing the absence/presence of the tuple t i , μ X i = E [ X i ]= p i ; hence, the co-
variance is E [ (X 1 μ X 1 ) · (X 2 μ X 2 ) ]= E [ X 1 · X 2 ]− E [ X 1 p 2 p 1 · E [ X 2 ]+ p 1 · p 2 = E [ X 1 · X 2 ]− p 1 · p 2 p 1 ·
p 2 + p 1 · p 2 = E [ X 1 · X 2 ]− p 1 · p 2 = P(t 1 Q(W ) t 2 Q(W )) P(t 1 Q(W )) · P(t 2 Q(W )) .
Search WWH ::




Custom Search