Database Reference
In-Depth Information
the best setting for the number of Bloom filters, k and the size of each Bloom filter, s
such that sk = M , is to be found.
SBF in this case establishes a constraint on p depending on a threshold FPR given
as input by the application, and then computes the optimal value of k by varying the
values of Max. Also, the nature of the stream and the expected number of FNs are
used to compute the best value of Max. In case, no such prior data is available, Max
is set to 1.
RSBF, on the other hand, accepts the threshold FPR, FPR t as input, and computes
the optimal value of k and s for overall low FPR and FNR. Assume that the algorithm
conforms to the threshold FPR, FPR t after the initial s elements of the stream has
been processed. An FPR will occur for an element e s +1 if all the corresponding bits
in the Bloom filter for e s +1 are set. Considering a single Bloom filter, the particular
bit into which e s +1 hashes to will be set if at least one of the s elements maps into it.
Hence, by similar computations
k
1
=
1
FPR
t
e
(13.17)
ln(
FPR
)
M
k
t
∴=
k
,and
s
=
1
ln
1
e
FPR is found to decrease with increase in the value of k , while FNR is the lowest
when k = 1. Hence, to optimize this trade-offs, the value of k is set as the arithmetic
mean of 1 and that obtained in Equation 13.17. Given the value of k , s is thus appro-
priately set accordingly.
13.6.2 r esults anD a nalysis
This section provides to the readers the performance comparisons of both RSBF
and SBF [12] algorithms on real as well as synthetic data sets. The real data set
containing clickstream data* having around 3 million records and random data set
with 1 billion records were used to evaluate the quality of membership query results
generated. The clickstream data set contained a user's click history over a period.
Simulating the fingerprints of URL clicks by users, the synthetic data set was uni-
formly randomly generated.
The two sets of experiments shown captures (a) the variation of FNR, FPR, and
convergence with increasing number of records in the input and (b) the variation of
FNR and FPR with increasing amounts of memory for sampling the input stream,
using multiple data sets for increasing percentage of duplicates. In all the experi-
ments, p * was set to 0.03. For faster changing streams or more biased reservoir sam-
pling method, p * is set to a higher value.
* Obtained from http://www.sigkdd.org/kddcup/index.php?section=2000&method=data.
Search WWH ::




Custom Search