Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
Directions in Sequential Probabilistic Databases There are other statistical models, notably Chain
CRFS [ Lafferty et al. , 2001 ], that produce output that can be modeled by Markov Sequences
(in spite of requiring different techniques for inference). This opens up even more applications
for Markov Sequence databases, notably more sophisticated information extraction [ Lafferty et al. ,
2001 , Sha and Pereira , 2003 ].
6.3 MONTE CARLO DATABASES
Some applications require a very general probabilistic model, which is not possible to decompose
into disjoint-independent blocks. An important class of such applications are financial risk assessment
systems . This has led to a more complex model of probabilistic databases, where certain attributes,
tuples, or regions of the database are associated with random variables that may have complex
continuous or discrete distributions. Given the rich probabilistic space, the tractable techniques for
query evaluation that we discussed in Chapter 4 and Chapter 5 no longer apply. The only approach
known to date to evaluate queries over probabilistic databases with rich probabilistic models is to use
Monte Carlo simulations throughout query processing. Hence, their name: Monte Carlo Databases,
or MCDB. These databases were introduced by Jampani et al. [ 2008 ] and further discussed by
Xu et al. [ 2009 ], Arumugam et al. [ 2010 ], and Kennedy and Koch [ 2010 ].
6.3.1 THE MCDB DATA MODEL
MCDB's represent a rich data model by combining two simple primitives: a large, predefined set
of random variables, and SQL queries (views). MCDB's do not encode uncertainty in the data
itself but allow the user to define arbitrary variable generation functions , VG, which are pseudo-
random generators for any random variable. The semantics of MCDB is the standard possible
worlds semantics for probabilistic databases. A relation is called deterministic if its realization is the
same in all possible worlds; otherwise, it is called probabilistic or random .
All the following examples are adapted from Jampani et al. [ 2008 ]; we allowed some syntactic
variations, for presentation purpose.
Consider a deterministic table Customer(cid, name, region, gender, age) .We
would like to store each customer's income, which is unknown. In our first example, we will de-
scribe the income as a normal distribution, with mean 10000 and standard deviation 2000:
CREATE TABLE CustIncome(cid, name, region, gender, age, income)
FOR EACH d IN Customer
SELECT d.cid, d.name, d.region, d.gender, d.age, x.value
FROM Normal(10000,2000) x
This SQL statement is in essence a view definition. Starting from the deterministic table
Customer , it constructs a probabilistic table CustIncome having one additional attribute, the cus-
tomer's income. Normal(mean,stdv) is a variable generating function (VG) that generates a normal
distribution with a given mean and standard deviation. All VG's in MCDB are implemented using
Search WWH ::




Custom Search