Database Reference
In-Depth Information
where N is the number of tuples and rule-tuples in T
(
t
)
, and F is the cumulative
distribution function of the binomial distribution.
Based on Theorems 5.14 and 5.15, we derive the following bound for the top-
k
probability of tuple
t
∈
T
.
Theorem 5.16 (Bounds of top-
k
probabilities).
For a tuple t
∈
T , the top-k proba-
bility of t satisfies
N
)
Pr
k
1. Pr
(
t
)
F
(
k
−
1;
N
,
p
max
)
≤
(
t
)
≤
Pr
(
t
)
F
(
k
−
1;
N
,
for
1
≤
k
≤
μ
;
,
N
)
≤
Pr
k
2. Pr
(
t
)
F
(
k
−
1;
N
(
t
)
≤
Pr
(
t
)
F
(
k
−
1;
N
,
p
min
)
for
μ +
1
≤
k
≤
N
+
1
.
Proof.
The theorem holds following from Theorems 5.14 and 5.15.
Example 5.9 (Bounding the probability). Consider the uncertain tuples in Table 2.2
again. T
(
t
3
)=
{
t
1
,
t
2
}
. The expected number of tuples in T
(
t
3
)
is
μ =
Pr
(
t
1
)+
Pr
(
is
2
.
Consider the top-
2
probability of t
3
. Since
2
t
2
)=
0
.
8
. The number of rules in T
(
t
3
)
,Pr
2
>
μ
(
t
3
)
≥
Pr
(
t
3
)
F
(
1; 2
,
0
.
4
)=
3
,Pr
2
0
.
7
×
0
.
84
=
0
.
588
. Since
min
{
Pr
(
t
1
)
,
Pr
(
t
2
)
}
=
0
.
(
t
3
)
≤
Pr
(
t
3
)
F
(
1; 2
,
0
.
3
)=
637
. Therefore, Pr
2
0
.
7
×
0
.
91
=
0
.
(
t
3
)
is bounded in range
[
0
.
588
,
0
.
637
]
.
Since the cumulative probability distribution of the binomial distribution is easier
to calculate than top-
k
probabilities, we propose
PRist+
, a variant of
PRist
using the
binomial distribution bounding technique.
The only difference between
PRist+
and
PRist
is the upper and lower ranks in
the
U
-lists and
L
-lists. In
PRist+
, we compute the upper and lower bounds of the
top-
k
probabilities of tuples using the binomial distributions. Then, the upper and
lower ranks are derived from the upper and lower bounds of the top-
k
probabilities.
Take the
U
-list in prob-interval
b
i
as an example. In
PRist
, an entry in the
U
-
list consists of the tuple id
t
and the upper rank
t
U
i
, such that
Pr
t
.
U
i
i
h
.
(
t
)
>
(if
Pr
m
i
h
). Once the top-
i
probabilities of all ranks 1
(
t
)
>
≤
i
≤
m
for
t
are computed,
t
.
U
i
can be obtained by one scan.
In
PRist+
, we store the upper rank
t
.
U
i
as the smallest rank
x
such that the lower
bound of
Pr
x
i
h
. Following with Theorem 5.16, the upper rank can
(
)
t
is greater than
,
N
)+
F
−
1
i
F
−
1
i
be calculated by
x
1, where
F
−
1
is the binomial inverse cumulative distribution function. The lower rank of
t
can be obtained similarly using Theorem 5.16.
Computing the upper and lower ranks for
t
in
b
i
requires
O
=
(
)
,
N
1or
x
=
(
)
,
N
,
p
max
)+
h
·
Pr
(
t
h
·
Pr
(
t
time. Thus, the
overall complexity of computing the upper and lower ranks of all tuples in all prob-
intervals is
O
(
1
)
(
2
hn
)
, where
n
is the total number of tuples. The complexity of sorting
the bound lists is
O
(
2
hn
log
n
)
. The overall time complexity of constructing a
PRist+
(
+
)=
(
)
index is
O
.
Clearly, the construction time of
PRist+
is much lower than
PRist
. The tradeoff
is that the rank bounds in
PRist+
is looser than
PRist
. As will be shown in the next
section, all query answering methods on
PRist
can be applied on
PRist+
. The looser
rank bounds in
PRist+
does not affect the accuracy of the answers. They only make
a very minor difference in query answering time in our experiments.
2
hn
2
hn
log
n
O
hn
log
n