Database Reference
In-Depth Information
i.e., Pr k
. Our objective is to draw a set of samples S of possible worlds,
and compute the mean of X t in S , namely E S [
(
t
)=
E
[
X t ]
X t ]
, as the approximation of E
[
X t ]
.
We use uniform sampling with replacement. For table T
= {
t 1 ,...,
t n }
and the set
of generation rules
, a sample unit (i.e., an observation) is a possible world. We
generate the sample units under the distribution of T : to pick a sample unit s ,we
scan T once. An independent tuple t i is included in s with probability Pr
R
(
t i )
. For a
multi-tuple generation rule R : t r 1 ⊕···⊕
(
)
t r m , s takes a probability of Pr
R
to include
one tuple involved in R .If s takes a tuple in R , then tuple t r l
(
)
1
l
m
is chosen
with probability Pr ( t r l )
Pr ( R )
. s can contain at most 1 tuple from any generation rule.
Once a sample unit s is generated, we compute the top- k tuples in s . For each
tuple t in the top- k list, X t =
1. The indicators for other tuples are set to 0.
The above sample generation process can be repeated so that a sample S is ob-
tained. Then, E S [
. When the sample size is
large enough, the approximation quality can be guaranteed following from the well
known Chernoff-Hoeffding bound [182].
X t ]
can be used to approximate E
[
X t ]
Theorem 5.8 (Sample size). For any
δ (
,
)
ε >
0
1
,
0 , and a sample S of possible
3ln 2
δ
worlds, if
|
S
|≥
, then for any tuple t, Pr
{|
E S [
X t ]
E
[
X t ] |> ε
E
[
X t ] }≤ δ .
2
ε
We can implement the sampling method efficiently using the following two tech-
niques, as verified by our experiments.
First, we can sort all tuples in T in the ranking order into a sorted list L . The first
k tuples in a sample unit are the top- k answers in the unit. Thus, when generating a
sample unit, instead of scanning the whole table T , we only need to scan L from the
beginning and generate the tuples in the sample as described before. However, once
the sample unit has k tuples, the generation of this unit can stop. In this way, we
reduce the cost of generating sample units without losing the quality of the sample.
For example, when all tuples are independent, if the average membership probability
is
k
μ
μ
, the expected number of tuples we need to scan to generate a sample unit is
,
which can be much smaller than
.
Second, in practice, the actual approximation quality may converge well before
the sample size reaches the bound given in Theorem 5.8. Thus, progressive sampling
can be adopted: we generate sample units one by one and compute the estimated
top- k probability of tuples after each unit is drawn. For given parameters d
|
T
|
when k
|
T
|
>
0 and
φ >
0, the sampling process stops if in the last d sample units the change of the
estimated X t for any tuple t is smaller than
.
To answer a PT- k query with probability threshold p , we first run the above sam-
pling algorithm. Then, we scan the tuples in L and output the tuples whose estimated
top- k probabilities are at least p .
After obtaining estimated top- k probabilities of tuples using the above sampling
method, top-
φ
(
k
,
l
)
queries and top-
(
p
,
l
)
queries can be answered similarly as dis-
cussed in Section 5.2.1.
 
Search WWH ::




Custom Search