Database Reference
In-Depth Information
2. DATA AND QUERY MODEL
ProducesProduct
Company
Product
P
sony
walkman
0 . 96
ibm
personal_computer
0 . 96
adobe
adobe_dreamweaver
0 . 87
Figure 2.6: A tuple-independent table, which is a fragment of ProducesProduct in Figure 1.2 .Ina
tuple-independent table, we only need to indicate the marginal tuple probabilities.
table. For a simple example, any deterministic table is a tuple-independent table. A tuple-independent
table can always be represented by a pc-table whose tuples t 1 ,t 2 ,t 3 ,... are annotated with distinct
Boolean variables X 1 ,X 2 ,X 3 ,... . Since each variable X i occurs only once, we don't need to store it
at all; instead, in a tuple-independent table, we store the probability p i = P(X i ) next to each tuple t i .
Thus, the schema of a tuple independent table is R(A 1 ,A 2 ,...,A m ,P) , where A 1 ,A 2 ,...,A m are
the regular attributes, and P is the tuple probability. Of course, a query cannot access P directly, so in
a query, the relation R will appear with the schema R(A 1 ,...,A m ) . Alternatively, we view a tuple-
independent table as a relation R(A 1 ,...,A m ) and a probability function P mapping tuples t R
to probabilities P(t) . With this convention, we denote a tuple-independent probabilistic database
as D
= ( R 1 ,...,R k ,P) .
Figure 2.6 shows a tuple-independent table called ProducesProduct. The marginal tuple prob-
abilities are in the right column; this is the same convention we used in Figure 1.2 . In this simple
example, there are 8 possible worlds, corresponding to the subsets of the table. The probability of
the world consisting of the first and the third tuple is 0 . 96
0 . 87.
Tuple-independent tables are good building blocks since there are no correlations and no
constraints between the tuples. However, they are obviously not complete since they can only rep-
resent probabilistic databases where all tuples are independent events: for a simple counterexample,
Figure 2.2 is not tuple-independent. However, more complex probabilistic databases can sometimes
be decomposed into tuple independent tables, and thus “normalized”; we illustrate with an example.
·
0 . 04
·
Example 2.15 Consider again the NELL database. Each fact is extracted from a Webpage, called
source . Some sources are more reliable and contain accurate facts, while other sources are less reliable
and contain often incorrect facts. We want to express the fact that tuples in the probabilistic database
are correlated with their source. This introduces additional correlations. For example, suppose two
tuples ProducesProduct(a, b) and ProducesProduct(c, d) are extracted from the same source. If the first
tuple is wrong, then it is wrong either because the source is wrong or because the extraction was wrong:
in the first case, the second tuple is likely to be wrong, too. Thus, if one tuple is wrong, the probability
that the other tuple is also wrong increases. For the same reason, there are now correlations between
tuples in different tables, if they come from the same source. While it is possible to represent this with
pc-tables, since pc-tables are a complete representation system, a better approach is to decompose
the data into two tuple-independent tables with the following schemas:
 
Search WWH ::




Custom Search