Database Reference
In-Depth Information
6. ADVANCED TECHNIQUES
all, even very general, probabilistic models. The disadvantage is that Monte Carlo simulations, at
least when evaluated naïvely, are costly.
The semantics of a query Q in an MCDB is given by a repeated execution of Q on sample
databases: the query is evaluated N times, over N randomly chosen worlds.The overall result depends
on the way the possible worlds semantics is closed in the query. For example, if probabilities of tuples
are computed, this probability is estimated by returning, for a possible result tuple (i.e., a tuple present
in the query result on at least one sample database), the ratio M/N , where M is the number of sample
result relations in which the tuple occurs. MCDBs also allow to compute a wide variety of statistical
tests and aggregates on the samples, beyond tuple probabilities. For example, consider the query:
SELECT age, sum(income), count(*)
FROM CustIncome
WHERE income > 12000
GROUP BY age
The query computes for each age bracket the sum of all incomes of all customers earning more
than 12000, and their number. When the query finishes one run, its answer consists of a set of tuples
t 1 ,t 2 ,... over all N runs. The MCDB collects all tuples, and computes a set of pairs (t i ,f i ) , where t i
is a possible tuple and f i is the frequency of that tuple over the N runs. It then returns a set of tuples
( age i , sum i , count i ,f i ) . This result can be used in many versatile ways. For example, the expected
value of the sum for each age can be obtained as E
[ sum ]= i sum i ·
f i . For large enough N , the
σ N / N where
1 ) i ( sum i E [ sum ] ) 2 f i ,
by the central limit theorem. Thus, by returning multiple sample answers as opposed to a single
aggregate value, the system provides much more utility.
To compute a set of sample databases and evaluate a query on each of them, it is, of course, not
necessary to materialize all possible worlds - since in general, infinite and even continuous probability
spaces are considered in MCDBs, this would not even be theoretically possible. Nevertheless, a naïve
evaluation following the sampling-based semantics of MCDBs is not practical because the parameter
N needed to obtain results of good quality may be very large. Thus, an MCDB has to use a number
of optimization techniques to alleviate this high cost. In the MCDB system of Jampani et al. [ 2008 ],
the following ideas are used to achieve this:
￿ Every query Q runs only once, but it returns tuple bundles instead of single tuples. A tuple
bundle is an array of tuples with the same schema t [ 1 ] ,t [ 2 ] ,... Tuple t [ i ]
σ N
+ /
= N/(N
accuracy of this estimator is
1 . 96
corresponds to the
i -th possible world in the Monte Carlo simulation, where i =
1 ,N . This allows the system
and t [ j ]
to check easily if two tuples belong to the same world: t [ i ]
belong to the same world
iff i
=
j .
￿ The materialization of a random attribute is delayed as long as possible. For example, if income
is not directly inspected by the query, then the attribute is not expanded.
￿ The values of the random variables are reproducible. For that, the seed used to generate that
random variable is stored and reused when it is needed again.
Search WWH ::




Custom Search