Database Reference
In-Depth Information
A memory-less temporal bias functions for streams for evolving streams have
been proposed in [3]. Apart from O (1) processing time per stream element, incorpo-
rating the bias results in upper bounds of reservoir sizes limiting the maximum space
requirement to nearly constant in most cases even for an infinitely long data stream.
The next section presents biased reservoir sampling techniques on the Bloom filters
for the de-duplication problem.
13.5 RESERVOIR SAMPLING-BASED BLOOM
FILTER (RSBF) APPROACH
This section discusses the design and working model of the reservoir sampling-
based stable Bloom filter (RSBF) for de-duplication in large data streams. RSBF
intelligently combines the concepts of reservoir sampling techniques [15] and that
of Bloom filter approach. RSBF provides the first of such an integration to exhibit
enhanced performance for de-duplication.
RSBF comprises k Bloom filters, each of size s bits and are initially set to 0. On
arrival of a new element, e it is hashed to one of the s bits in each of the k Bloom
filters with the help of k different uniform random hash functions. The existence of
the element is verified by checking whether these k bit positions are set. If all the k
bit positions are set to 1, then RSBF reports the element to be duplicate, else to be
distinct. RSBF directly inserts the initial s elements of the stream into the structure
by setting the corresponding k bit positions in the Bloom filter. Each element e i , for
i  > s , is then first probed against the Bloom filter structure to determine the duplicate
or distinct status. If e i is reported as distinct, it is inserted in the structure with prob-
ability p i = s / i (insert probability), where i is the current length of the stream and s is
the size of each of the Bloom filter.
However, with the increase in the number of bits set in the Bloom filters, RSBF
would suffer from a high rate of false positives wherein a distinct element is falsely
reported as duplicate. As the length of the stream increases, it can be observed that
the probability of an element being a duplicate increases (since the elements are
drawn from a finite universe). The reservoir sampling method implicitly helps to pre-
vent such a scenario by increasingly rejecting elements from being inserted into the
structure (as the insert probability decreases). Insertion of elements from a possibly
infinite stream would inevitable lead to the setting of nearly all the bits of RSBF to 1,
thereby incurring a high false-positive rate (FPR). To alleviate this problem, when-
ever an element is inserted into RSBF, the algorithm also deletes k randomly uni-
formly chosen bit (one from each Bloom filter) by setting it to 0. It should be observed
that such deletion operation invariably leads to the presence of false negatives.
Applications involving duplicate detection demand low tolerance for both false
positive as well as false-negative rates (FNR). The use of reservoir sampling helps
to keep the false-positive rate significantly lower. However, the repeated rejection
of elements (possibly distinct) with increase in the stream length may result in an
increase of the FNR, thereby degrading the performance of RSBF. To address this
problem, RSBF uses a weak form of biasing on the reservoir sampling operation per-
formed on the stream elements. When the insert probability of an element decreases
beyond a specified threshold, p* , and is reported as distinct by probing its bits, the
Search WWH ::




Custom Search