Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
Directions in Sequential Probabilistic Databases
There are other statistical models, notably Chain
CRFS [
Lafferty et al.
,
2001
], that produce output that can be modeled by Markov Sequences
(in spite of requiring different techniques for inference). This opens up even more applications
for Markov Sequence databases, notably more sophisticated information extraction [
Lafferty et al.
,
2001
,
Sha and Pereira
,
2003
].
6.3 MONTE CARLO DATABASES
Some applications require a very general probabilistic model, which is not possible to decompose
into disjoint-independent blocks. An important class of such applications are
financial risk assessment
systems
. This has led to a more complex model of probabilistic databases, where certain attributes,
tuples, or regions of the database are associated with random variables that may have complex
continuous or discrete distributions. Given the rich probabilistic space, the tractable techniques for
query evaluation that we discussed in
Chapter 4
and
Chapter 5
no longer apply. The only approach
known to date to evaluate queries over probabilistic databases with rich probabilistic models is to use
Monte Carlo simulations throughout query processing. Hence, their name: Monte Carlo Databases,
or MCDB. These databases were introduced by
Jampani et al.
[
2008
] and further discussed by
Xu et al.
[
2009
],
Arumugam et al.
[
2010
], and
Kennedy and Koch
[
2010
].
6.3.1 THE MCDB DATA MODEL
MCDB's represent a rich data model by combining two simple primitives: a large, predefined set
of random variables, and SQL queries (views). MCDB's do not encode uncertainty in the data
itself but allow the user to define arbitrary
variable generation functions
, VG, which are pseudo-
random generators for any random variable. The semantics of MCDB is the standard possible
worlds semantics for probabilistic databases. A relation is called
deterministic
if its realization is the
same in all possible worlds; otherwise, it is called
probabilistic
or
random
.
All the following examples are adapted from
Jampani et al.
[
2008
]; we allowed some syntactic
variations, for presentation purpose.
Consider a deterministic table
Customer(cid, name, region, gender, age)
.We
would like to store each customer's income, which is unknown. In our first example, we will de-
scribe the income as a normal distribution, with mean 10000 and standard deviation 2000:
CREATE TABLE CustIncome(cid, name, region, gender, age, income)
FOR EACH d IN Customer
SELECT d.cid, d.name, d.region, d.gender, d.age, x.value
FROM Normal(10000,2000) x
This SQL statement is in essence a view definition. Starting from the deterministic table
Customer
, it constructs a probabilistic table
CustIncome
having one additional attribute, the cus-
tomer's income.
Normal(mean,stdv)
is a variable generating function (VG) that generates a normal
distribution with a given mean and standard deviation. All VG's in MCDB are implemented using