Overview - Probabilistic Databases

Database Reference

In-Depth Information

1.2

KEY CONCEPTS

1.2.1

PROBABILITIES AND THEIR MEANING IN DATABASES

How I stopped worrying and started to love probabilities 3 .

Where do the probabilities in a probabilistic database come from? And what exactly do

they mean? The answer to these questions may differ from application to application, but it is

rarely satisfactory. Information extraction systems are based on probabilistic models, so the data they

extract is probabilistic [ Gupta and Sarawagi , 2006 , Lafferty et al. , 2001 ]; RFID readings are cleaned

using particle filters that also produce probability distributions [ Ré et al. , 2008 ]; data analytics in

financial prediction rely on statistical models that often generate probabilistic data [ Jampani et al. ,

2008 ]. In some cases, the probability values have a precise semantics, but that semantics is often

associated with the way the data is derived and not necessarily with how the data will be used. In

other cases we have no probabilistic semantics at all but only a subjective confidence level that needs

to be converted into a probability: for example, Google Squared does not even associate numerical

scores, but defines a fixed number of confidence levels (high, low, etc.), which need to be converted

into a probabilistic score in order to be merged with other data and queried. Another example is

BioRank [ Detwiler et al. , 2009 ], which uses as input subjective and relative weights of evidence and

converts those into probabilistic weights in order to compute relevance scores to rank most likely

functions for proteins.

No matter how they were derived, we always map a confidence score to the interval

and

interpret it as a probability value. The important invariant is that a larger value always represents a

higher degree of confidence, and this carries over to the query output: answers with a higher (com-

puted) probability are more credible than answers with a lower probability. Typically, a probabilistic

database ranks the answers to a query by their probabilities: the ranking is often more informative

than the absolute values of their probabilities.

[ 0 , 1 ]

1.2.2 POSSIBLE WORLDS SEMANTICS

The meaning of a probabilistic database is surprisingly simple: it means that the database instance

can be in one of several states, and each state has a probability. That is, we are not given a single

database instance but several possible instances, and each has some probability. For example, in the

case of NELL, the content of the database can be any subset of the 537K tuples. We don't know

which ones are correct and which ones are wrong. Each subset of tuples is called a possible world and

has a probability: the sum of probabilities of all possible worlds is 1 . 0. Similarly, for a database where

the uncertainty is at the attribute level, a possible world is obtained by choosing a possible value for

each uncertain attribute, in each tuple.

Thus, a probabilistic database is simply a probability distribution over a set of possible worlds.

While the number of possible worlds is astronomical, e.g., 2 537000 possible worlds for NELL, this

3 One of the coauthors.

Search WWH ::

Custom Search

Home