Data andQuery Model - Probabilistic Databases

Database Reference

In-Depth Information

2. DATA AND QUERY MODEL

ProducesProduct

Company

Product

sony

walkman

0 . 96

ibm

personal_computer

0 . 96

adobe

adobe_dreamweaver

0 . 87

Figure 2.6: A tuple-independent table, which is a fragment of ProducesProduct in Figure 1.2 .Ina

tuple-independent table, we only need to indicate the marginal tuple probabilities.

table. For a simple example, any deterministic table is a tuple-independent table. A tuple-independent

table can always be represented by a pc-table whose tuples t 1 ,t 2 ,t 3 ,... are annotated with distinct

Boolean variables X 1 ,X 2 ,X 3 ,... . Since each variable X i occurs only once, we don't need to store it

at all; instead, in a tuple-independent table, we store the probability p i = P(X i ) next to each tuple t i .

Thus, the schema of a tuple independent table is R(A 1 ,A 2 ,...,A m ,P) , where A 1 ,A 2 ,...,A m are

the regular attributes, and P is the tuple probability. Of course, a query cannot access P directly, so in

a query, the relation R will appear with the schema R(A 1 ,...,A m ) . Alternatively, we view a tuple-

independent table as a relation R(A 1 ,...,A m ) and a probability function P mapping tuples t ∈ R

to probabilities P(t) . With this convention, we denote a tuple-independent probabilistic database

as D

= ( R 1 ,...,R k ,P) .

Figure 2.6 shows a tuple-independent table called ProducesProduct. The marginal tuple prob-

abilities are in the right column; this is the same convention we used in Figure 1.2 . In this simple

example, there are 8 possible worlds, corresponding to the subsets of the table. The probability of

the world consisting of the first and the third tuple is 0 . 96

0 . 87.

Tuple-independent tables are good building blocks since there are no correlations and no

constraints between the tuples. However, they are obviously not complete since they can only rep-

resent probabilistic databases where all tuples are independent events: for a simple counterexample,

Figure 2.2 is not tuple-independent. However, more complex probabilistic databases can sometimes

be decomposed into tuple independent tables, and thus “normalized”; we illustrate with an example.

0 . 04

Example 2.15 Consider again the NELL database. Each fact is extracted from a Webpage, called

source . Some sources are more reliable and contain accurate facts, while other sources are less reliable

and contain often incorrect facts. We want to express the fact that tuples in the probabilistic database

are correlated with their source. This introduces additional correlations. For example, suppose two

tuples ProducesProduct(a, b) and ProducesProduct(c, d) are extracted from the same source. If the first

tuple is wrong, then it is wrong either because the source is wrong or because the extraction was wrong:

in the first case, the second tuple is likely to be wrong, too. Thus, if one tuple is wrong, the probability

that the other tuple is also wrong increases. For the same reason, there are now correlations between

tuples in different tables, if they come from the same source. While it is possible to represent this with

pc-tables, since pc-tables are a complete representation system, a better approach is to decompose

the data into two tuple-independent tables with the following schemas:

Probabilistic Databases

Search WWH ::

Custom Search

Home