Database Reference
In-Depth Information
1.2
KEY CONCEPTS
1.2.1
PROBABILITIES AND THEIR MEANING IN DATABASES
How I stopped worrying and started to love probabilities 3 .
Where do the probabilities in a probabilistic database come from? And what exactly do
they mean? The answer to these questions may differ from application to application, but it is
rarely satisfactory. Information extraction systems are based on probabilistic models, so the data they
extract is probabilistic [ Gupta and Sarawagi , 2006 , Lafferty et al. , 2001 ]; RFID readings are cleaned
using particle filters that also produce probability distributions [ RĂ© et al. , 2008 ]; data analytics in
financial prediction rely on statistical models that often generate probabilistic data [ Jampani et al. ,
2008 ]. In some cases, the probability values have a precise semantics, but that semantics is often
associated with the way the data is derived and not necessarily with how the data will be used. In
other cases we have no probabilistic semantics at all but only a subjective confidence level that needs
to be converted into a probability: for example, Google Squared does not even associate numerical
scores, but defines a fixed number of confidence levels (high, low, etc.), which need to be converted
into a probabilistic score in order to be merged with other data and queried. Another example is
BioRank [ Detwiler et al. , 2009 ], which uses as input subjective and relative weights of evidence and
converts those into probabilistic weights in order to compute relevance scores to rank most likely
functions for proteins.
No matter how they were derived, we always map a confidence score to the interval
and
interpret it as a probability value. The important invariant is that a larger value always represents a
higher degree of confidence, and this carries over to the query output: answers with a higher (com-
puted) probability are more credible than answers with a lower probability. Typically, a probabilistic
database ranks the answers to a query by their probabilities: the ranking is often more informative
than the absolute values of their probabilities.
[ 0 , 1 ]
1.2.2 POSSIBLE WORLDS SEMANTICS
The meaning of a probabilistic database is surprisingly simple: it means that the database instance
can be in one of several states, and each state has a probability. That is, we are not given a single
database instance but several possible instances, and each has some probability. For example, in the
case of NELL, the content of the database can be any subset of the 537K tuples. We don't know
which ones are correct and which ones are wrong. Each subset of tuples is called a possible world and
has a probability: the sum of probabilities of all possible worlds is 1 . 0. Similarly, for a database where
the uncertainty is at the attribute level, a possible world is obtained by choosing a possible value for
each uncertain attribute, in each tuple.
Thus, a probabilistic database is simply a probability distribution over a set of possible worlds.
While the number of possible worlds is astronomical, e.g., 2 537000 possible worlds for NELL, this
3 One of the coauthors.
Search WWH ::




Custom Search