Database Reference
In-Depth Information
1. OVERVIEW
compute the output probabilities correctly for any input database, then it is called a safe query plan .
Safe plans are easily added 6 to a relational database engine, either by small modifications of the
relational operators or even without any change in the engine by simply rewriting the SQL query
to manipulate the probabilities explicitly.
If a query admits a safe plan, then its data complexity is in polynomial time because any safe
plan can be computed in polynomial time in the size of the input database by simply evaluating its
operators bottom-up. Not all queries admit safe plans; as we will show in Chapter 3 , for specific
queries, we can prove that their data complexity is hard for # P , and these obviously will not have a
safe plan (unless P
= # P ). If a query admits a safe query plan, then it is called a safe query ; otherwise,
it is called unsafe .
The notion of query safety should be thought of as a syntactic notion: we are given a set of
rules for generating a safe plan for a query, and, if these rules succeed, then the query is called safe;
if the rules fail, then we call the query unsafe. We describe a concrete set of such rules in Chapter 4 .
The question is whether these rules are complete: if the query can be computed in polynomial time
by some algorithm, will we also find a safe plan for it? We show in Chapter 4 that the answer is
yes if one restricts queries to unions of conjunctive queries and the databases to tuple-independent
probabilistic databases. In this case, we have a dichotomy: for every query, either its data complexity
is in polynomial time (when the query is safe) or is provably hard for # P (when it is unsafe).
The terms safe query and safe query plan were introduced in the MystiQ project by
Dalvi and Suciu [ 2004 ].
1.3
APPLICATIONS OF PROBABILISTIC DATABASES
In recent years, there has been an increased interest in probabilistic databases. The main reason for
this has been the realization that many diverse applications need a generic platform for managing
probabilistic data; while the focus of this topic is on techniques for managing probabilistic databases,
we describe next some of these applications accompanied by an extensive list of references for further
reading.
Information extraction (IE), already mentioned in this chapter, is a very natural application for
probabilistic databases because some important IE techniques already generate probabilistic data.
For example, Conditional Random Fields (CRFs) [ Lafferty et al. , 2001 ] define a probability space
over the possible ways to parse a text. Typically, IE systems retain the most probable extraction,
but Gupta and Sarawagi [ 2006 ] show that by storing multiple (or even all) alternative extractions
of a CRF in a probabilistic database, one can increase significantly the overall recall of the system,
thus justifying the need for a probabilistic database. Wang et al. [ 2008a ], Wang et al. [ 2010b ], and
Wang et al. [ 2010a ] describe a system, BayesStore, which stores the CRF in a relational database
system and pushes the probabilistic inference inside the engine. Wick et al. [ 2010 ] describe an
6 One should be warned, however, that the requirement of the plan to be safe severely restricts the options of a query optimizer,
which makes the engineering aspects of integrating safe plans into a relational engine much more challenging than they seem at
the conceptual level.
Search WWH ::




Custom Search