Overview - Probabilistic Databases

Database Reference

In-Depth Information

1. OVERVIEW

compute the output probabilities correctly for any input database, then it is called a safe query plan .

Safe plans are easily added 6 to a relational database engine, either by small modifications of the

relational operators or even without any change in the engine by simply rewriting the SQL query

to manipulate the probabilities explicitly.

If a query admits a safe plan, then its data complexity is in polynomial time because any safe

plan can be computed in polynomial time in the size of the input database by simply evaluating its

operators bottom-up. Not all queries admit safe plans; as we will show in Chapter 3 , for specific

queries, we can prove that their data complexity is hard for # P , and these obviously will not have a

safe plan (unless P

= # P ). If a query admits a safe query plan, then it is called a safe query ; otherwise,

it is called unsafe .

The notion of query safety should be thought of as a syntactic notion: we are given a set of

rules for generating a safe plan for a query, and, if these rules succeed, then the query is called safe;

if the rules fail, then we call the query unsafe. We describe a concrete set of such rules in Chapter 4 .

The question is whether these rules are complete: if the query can be computed in polynomial time

by some algorithm, will we also find a safe plan for it? We show in Chapter 4 that the answer is

yes if one restricts queries to unions of conjunctive queries and the databases to tuple-independent

probabilistic databases. In this case, we have a dichotomy: for every query, either its data complexity

is in polynomial time (when the query is safe) or is provably hard for # P (when it is unsafe).

The terms safe query and safe query plan were introduced in the MystiQ project by

Dalvi and Suciu [ 2004 ].

1.3

APPLICATIONS OF PROBABILISTIC DATABASES

In recent years, there has been an increased interest in probabilistic databases. The main reason for

this has been the realization that many diverse applications need a generic platform for managing

probabilistic data; while the focus of this topic is on techniques for managing probabilistic databases,

we describe next some of these applications accompanied by an extensive list of references for further

reading.

Information extraction (IE), already mentioned in this chapter, is a very natural application for

probabilistic databases because some important IE techniques already generate probabilistic data.

For example, Conditional Random Fields (CRFs) [ Lafferty et al. , 2001 ] define a probability space

over the possible ways to parse a text. Typically, IE systems retain the most probable extraction,

but Gupta and Sarawagi [ 2006 ] show that by storing multiple (or even all) alternative extractions

of a CRF in a probabilistic database, one can increase significantly the overall recall of the system,

thus justifying the need for a probabilistic database. Wang et al. [ 2008a ], Wang et al. [ 2010b ], and

Wang et al. [ 2010a ] describe a system, BayesStore, which stores the CRF in a relational database

system and pushes the probabilistic inference inside the engine. Wick et al. [ 2010 ] describe an

6 One should be warned, however, that the requirement of the plan to be safe severely restricts the options of a query optimizer,

which makes the engineering aspects of integrating safe plans into a relational engine much more challenging than they seem at

the conceptual level.

Search WWH ::

Custom Search

Home