Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

A memory-less temporal bias functions for streams for evolving streams have

been proposed in [3]. Apart from O (1) processing time per stream element, incorpo-

rating the bias results in upper bounds of reservoir sizes limiting the maximum space

requirement to nearly constant in most cases even for an infinitely long data stream.

The next section presents biased reservoir sampling techniques on the Bloom filters

for the de-duplication problem.

13.5 RESERVOIR SAMPLING-BASED BLOOM

FILTER (RSBF) APPROACH

This section discusses the design and working model of the reservoir sampling-

based stable Bloom filter (RSBF) for de-duplication in large data streams. RSBF

intelligently combines the concepts of reservoir sampling techniques [15] and that

of Bloom filter approach. RSBF provides the first of such an integration to exhibit

enhanced performance for de-duplication.

RSBF comprises k Bloom filters, each of size s bits and are initially set to 0. On

arrival of a new element, e it is hashed to one of the s bits in each of the k Bloom

filters with the help of k different uniform random hash functions. The existence of

the element is verified by checking whether these k bit positions are set. If all the k

bit positions are set to 1, then RSBF reports the element to be duplicate, else to be

distinct. RSBF directly inserts the initial s elements of the stream into the structure

by setting the corresponding k bit positions in the Bloom filter. Each element e i , for

i > s , is then first probed against the Bloom filter structure to determine the duplicate

or distinct status. If e i is reported as distinct, it is inserted in the structure with prob-

ability p i = s / i (insert probability), where i is the current length of the stream and s is

the size of each of the Bloom filter.

However, with the increase in the number of bits set in the Bloom filters, RSBF

would suffer from a high rate of false positives wherein a distinct element is falsely

reported as duplicate. As the length of the stream increases, it can be observed that

the probability of an element being a duplicate increases (since the elements are

drawn from a finite universe). The reservoir sampling method implicitly helps to pre-

vent such a scenario by increasingly rejecting elements from being inserted into the

structure (as the insert probability decreases). Insertion of elements from a possibly

infinite stream would inevitable lead to the setting of nearly all the bits of RSBF to 1,

thereby incurring a high false-positive rate (FPR). To alleviate this problem, when-

ever an element is inserted into RSBF, the algorithm also deletes k randomly uni-

formly chosen bit (one from each Bloom filter) by setting it to 0. It should be observed

that such deletion operation invariably leads to the presence of false negatives.

Applications involving duplicate detection demand low tolerance for both false

positive as well as false-negative rates (FNR). The use of reservoir sampling helps

to keep the false-positive rate significantly lower. However, the repeated rejection

of elements (possibly distinct) with increase in the stream length may result in an

increase of the FNR, thereby degrading the performance of RSBF. To address this

problem, RSBF uses a weak form of biasing on the reservoir sampling operation per-

formed on the stream elements. When the insert probability of an element decreases

beyond a specified threshold, p* , and is reported as distinct by probing its bits, the

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home