Overview - Probabilistic Databases

Database Reference

In-Depth Information

of pairs (t 1 ,p 1 ), (t 2 ,p 2 ),... where t 1 ,t 2 ,... are distinct tuples and p 1 ,p 2 ,... are their marginal

probabilities, such that the answers are ranked by p 1 ≥ p 2 ≥ ...

This semantics does not report how distinct tuples are correlated. Thus, we know that t 1 is

an answer with probability p 1 and that t 2 is an answer with probability p 2 , but we do not know the

probability that both t 1 and t 2 are answers. This probability can be 0 if t 1 ,t 2 are mutually exclusive;

it can be p 1 p 2 if t 1 and t 2 are independent events; or it can be min (p 1 ,p 2 ) if the set of worlds

where one of the tuples is an answer is contained in the set of worlds where the other tuple is an

answer. The probability that both t 1 ,t 2 occur in the answer minus p 1 p 2 is called the covariance 4

of t 1 ,t 2 : the tuples are independent iff the covariance is 0. Probabilistic database systems prefer to

drop any correlation information from the query's output because it is difficult to represent for a

large number of tuples. Users, however, can still inquire about the correlation, by asking explicitly

for the probability of both t 1 and t 2 . For example, consider the earlier query Retrieve all products

manufactured by a company headquartered in San Jose , and suppose we want to know the probability

that both adobe_indesign and adobe_dreamweaver are in the answer. We can compute that by

running a second query, Retrieve all pairs of products, each manufactured by a company headquartered in

San Jose , and looking up the probability of the pair adobe_indesign , adobe_dreamweaver in

the answer. Thus, while one single query does not convey information about the correlations between

the tuples in its answer, this information can always be obtained later, by asking additional, more

complex queries on the probabilistic database.

1.2.6 LINEAGE

The lineage of a possible output tuple to a query is a propositional formula over the input tuples in

the database, which says which input tuples must be present in order for the query to return that

output. Consider again the query “Retrieve all products manufactured by a company headquartered

in San Jose ” on the database in Figure 1.2 . The output tuple ( personal_computer , ibm ) has

lineage expression X 1 ∧ Y 1 , where X 1 and Y 1 represent the two input tuples shown in Figure 1.2 ;

this is because both X 1 and Y 1 must be in the database to ensure that output. For another example,

consider the query find all cities that are headquarters of some companies : the answer san_jose has

lineage Y 1 ∨ Y 2 because any of Y 1 or Y 2 are sufficient to produce the answer san_jose . Query

evaluation on probabilistic databases essentially reduces to the problem of computing the probability

of propositional formulas, representing lineage expressions. We discuss this in detail in Chapter 2 .

We note that the term “lineage” is sometimes used in the literature with slightly different and

not always consistent meanings. In this topic, we will use the term lineage to denote a propositional

formula. It corresponds to the PosBool provenance semiring of Green et al. [ 2007 ], which is the

semiring of positive Boolean expressions, except that we also allow negation.

4 The

covariance

two

random

variables X 1 ,X 2

is cov(X 1 ,X 2 ) = E [ (X 1 − μ X 1 ) · (X 2 − μ X 2 ) ]

. In

our

case, X i ∈

{ 0 , 1 }

is the random variable representing the absence/presence of the tuple t i , μ X i = E [ X i ]= p i ; hence, the co-

variance is E [ (X 1 − μ X 1 ) · (X 2 − μ X 2 ) ]= E [ X 1 · X 2 ]− E [ X 1 ]· p 2 − p 1 · E [ X 2 ]+ p 1 · p 2 = E [ X 1 · X 2 ]− p 1 · p 2 − p 1 ·

p 2 + p 1 · p 2 = E [ X 1 · X 2 ]− p 1 · p 2 = P(t 1 ∈ Q(W ) ∧ t 2 ∈ Q(W )) − P(t 1 ∈ Q(W )) · P(t 2 ∈ Q(W )) .

Probabilistic Databases

Search WWH ::

Custom Search

Home