Extensional Query Evaluation - Probabilistic Databases

Database Reference

In-Depth Information

each atom. In any of the new queries Q r

created by the independent-project rule, the constant

a is not eligible for ranking because if it occurs on some position in one atom R(...,a,...) , then

every R -atom has a on that position.

[

a/x

]

4.1.2.7 General Remarks About the Rules

Every rule starts from a query Q and expresses P (Q) in terms of simpler queries P(Q 1 ) , P(Q 2 ),...

For each of the queries Q i , we apply another rule, and again, until we reach ground tuples, at which

point, we simply look up their probability in the database. This process succeeds only if all branches

of the rewriting end in ground tuples. If the rules succeed in computing the query, then we call it a safe

query ; otherwise, we call it an unsafe query . The rules are non-deterministic: for example, whenever

we can apply an independent join, we can also apply inclusion-exclusion. No matter in what order we

apply the rules; if they terminate, then the result P (Q) is guaranteed to be correct. We will illustrate

several examples of unsafe queries ( Subsection 4.1.3 ) and safe queries ( Subsection 4.1.4 ).

What is the data complexity of computing P (Q) using the rules? Five of the rewrite rules

do not mention the database at all: the independent-project rule is the only rule that depends on the

active domain of the database, and it increases by a factor of n

the number of queries

that need to be evaluated, while reducing by 1 the arity of all relational symbols; therefore, if the

maximum arity of any relational symbol in Q is k , then the data complexity of computing P (Q) is

O(n k ) . Thus, if we can evaluate P (Q) using these rules, then it has polynomial time data complexity.

For queries in the Relational Calculus (RC), safety is not a decidable property. This

follows from Trakthenbrot's theorem, which states that it is undecidable to check whether a

Boolean RC expression is satisfiable over finite models Libkin [ 2004 ]. For example, consider

H 0 =

ADom

R(x),S(x,y),T(y) , the query defined in Section 3.2 . Clearly, H 0 is unsafe (we will examine

it closer in the next section). Let Q be an arbitrary query in RC, which does not use the relational

symbols R, S, T ; note that Q may use negation. Then the query H 0 ∧

Q is safe iff Q is not satisfiable:

hence, safety is undecidable.

For UCQ queries, safety is decidable because the rules can be applied in a systematic way,

as follows. First, rank all attribute-constant pairs, then repeat the following sequence of steps. (1)

Convert Q from a DNF expression Q 1 ∨ Q 2 ∨ ... to a CNF expression Q = Q 1 ∧ Q 2 ∧ ... by

applying the distributivity law. Each query Q i is called a disjunctive query (it is a disjunction of

connected conjunctive queries, see Figure 4.2 ). Apply the independent-join rule ( Eq. (4.1) ), P (Q) =

P(Q i 11 ∧

, if possible. (2) Apply the inclusion-exclusion formula,

Eq. (4.7) , to the remaining conjunctions. This results in several disjunctive queries Q j 1 ∨ Q j 2 ∨ ... :

the number of such disjunctive queries is exponential in the size of the query (of course, it is

independent on the size of the database). (3) Apply independent union Eq. (4.5) , if at all possible (4)

For each remaining disjunction Q

Q i 12 ...)

P(Q i 21 ∧

Q i 22 ...)

···

... , use Eq. (4.3) to choose a separator variable, x ,

then apply independent project Eq. (4.2) , to obtain new queries of the form Q [ a/x ]

Q j 1 ∨

Q j 2 ∨

; if no separator

variable exists, apply the attribute-attribute ranking rule for some pairs of attributes, and search again

for a separator variable. (5) Each Q [ a/x ]

is expressed in DNF; repeat from step (1). If we ever get

Probabilistic Databases

Search WWH ::

Custom Search

Home