Advanced Techniques - Probabilistic Databases

Database Reference

In-Depth Information

makes query evaluation on the view no more efficient than expanding the view definition in the

query.

To cope with this problem researchers have considered three main techniques. The first two

techniques exploit the common substructure in the formula. The first idea is a sophisticated form

of caching. The main idea is that the lineage formula may have common sub-components. Rather

than re-evaluating these sub-components, we can simply cache the results and avoid expensive re-

computation. Of course, identifying such sub-components is a difficult problem and is the main

technical challenge. This idea has been applied both to BID-style databases [ Sarmaetal. , 2008b ]

and to graphical model-based approaches [ Senetal. , 2009 ].

A second approach that exploits the regularity of the lineage formula is to approximate the

lineage formula. Here, the main idea is to replace the original Boolean lineage formula t for a tuple t

with a new Boolean formula ˜ t that is smaller (i.e., uses fewer variables). More precisely, given some

ε ≥ 0, we can choose a formula

[ ˜ t = ˜ t ]≤ ε , i.e., the disagreement probability is less

than ε . The size of ˜ t is a function only of ε , but the size can be bounded independently from the

size of the original lineage formula. Simply shrinking the lineage has can improve query processing

time: roughly speaking, the Luby-Karp-based algorithms of Section 5.3 that approximate P ()

take time roughly quadratic in the size of the lineage formula. A second advantage of this approach

is that the formula ˜ is syntactically identical to a conventional lineage formula and so requires

no additional machinery to process. Finding the smallest such ˜ t is a computationally challenging

problem. Nevertheless, one can show that even simple greedy solutions still find ˜ t that are hundreds

of times smaller. Using ideas from Boolean Harmonic Analysis, one can replace the functions ˜ t

with multi-linear polynomials which may allow even further compression. But, the resulting lineage

formula require new query processing algorithms.This approach is discussed by Ré and Suciu [ 2008 ].

A third (and more aggressive technique), discussed by Dalvi et al. [ 2011 ], is to simply throw

away the entire lineage formula - essentially assuming that the tuples in the view are a new relation.

Of course, using such a view naively may result in incorrect answers. However, some queries may

not be affected by this choice. A trivial example is that a query which asks for a marginal probability

of a single tuple can be trivially answered. But so can more sophisticated queries. Intuitively, if one

can identify that a query only touches tuples whose correlations are either independent or disjoint

in the view - avoiding those tuples that may have complex correlations - one can use the view to

process the query. Deducing this is non-trivial (the problem is 2 -Complete [ Dalvi et al. , 2011 ]).

Nonetheless, there are efficient sound (but not complete) heuristics that can decide, given a query

Q and a view V ,if V can be used to answer Q [ Dalvi et al. , 2011 ].

˜ t

so that P

Search WWH ::

Custom Search

Home