Database Reference
In-Depth Information
6.2 A Sampling Method
In this section, we propose a sampling method to estimate the top- k probability of
each stream with a probabilistic quality guarantee.
For a stream O in a sliding window W t
, we are interested in the event that O
is ranked top- k . Let Z O be the indicator to the event: Z O =
( O )
1if O is ranked top- k in
W t
Pr k
( O )
; Z O =
(
Z O =
)=
(
)
0 otherwise. Then, Pr
1
O
.
To approximate the probability Pr
(
Z O
=
)
, we design a statistic experiment as
follows. We draw samples of possible worlds and compute the top- k lists on the
samples. That is, in a sample, for each stream O we select an instance o in window
W t
1
. A sample is a possible world. Then, we sort all instances in the sample in
the ranking order and find the top- k list.
We repeat the experiment m times independently. Let Z O , i be the value of Z O at
the i -th run. Then, E
(
O
)
1
m
m
i = 1 Z O , i is an estimation of E
.
By using a sufficiently large number of samples, we can obtain a good approx-
imation of Pr k
[
Z O ]=
[
Z O ]=
Pr
(
Z O =
1
)
(
)
with high probability. The methods follows the idea of unre-
stricted random sampling (also known as simple random sampling with replace-
ment) [34]. The following minimum sample size can be derived from the well
known Chernoff-Hoffding bound [182].
Z O
Theorem 6.5 (Minimum sample size). For any stream O, let Z O , 1 ,...,
Z O , m be the
values of Z O in m independent experiments, and E
1
m
i
[
Z O ]=
1 Z O , i . For any
δ
=
3ln 2
δ
{| E
( 0
< δ <
1 ),
ξ
(
ξ >
0 ), if m
, then Pr
[
Z O ]
E
[
Z O ] |> ξ }≤ δ
.
ξ
2
For efficient implementation, we maintain an indicator variable for each stream.
To avoid sorting instances repeatedly, we first sort all instances in a sliding window
W t
. When drawing a sample, we scan the sorted list from the beginning, and
select an instance o
( O )
1
ω
if stream O has no instance in a sample
yet. If an instance is chosen, the corresponding stream indicator is set. When the
sample already contains k instances, the scan stops since the instances sampled later
cannot be in the top- k list. The sample can then be discarded since it will not be used
later.
The space complexity of the sampling method is O
O in probability
, because all instances in
the sliding window have to be stored. The time complexity is O
(
n
ω)
(
mn
ω +
n
ω
log
(
n
ω))
for the first window and O
for other windows where m is the
number of samples drawn, since the n new instances can be inserted into the sorted
list in W t
(
mn
ω +
n log
(
n
ω))
to form the sorted list in W t + 1
( O )
( O )
.
6.3 Space Efficient Methods
In the exact algorithm and the sampling algorithm, we need to store the sliding
window in main memory. In some applications, there can be a large number of
streams and the window width is non-trivial. In this section, we develop the space
 
Search WWH ::




Custom Search