Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

13.2.2 b loom F filters

Bloom filters [8,9] are space-efficient probabilistic bit-vector data structures pro-

viding fast membership queries on sets [8]. A Bloom filter consists of a bit vector

and k hash functions. An element is hashed onto k bit positions of the Bloom filter

using the hash functions. Typical Bloom filter approaches involve k comparisons for

each record for checking the corresponding bit positions of the Bloom filter array.

However, the efficiency of Bloom filters come at the cost of a small false-positive

rate, wherein the Bloom filter falsely reports the presence of the query element. This

occurs due to hash collision of multiple elements onto a single bit position of the

Bloom filter. Interestingly false negatives are never encountered. The probability of

a false positive for a standard Bloom filter is given by [9]: FPR ≈ (l − e − kn / m ) k , where

m is the number of elements in the stream and n is the number of distinct elements

observed. Given n and m , the optimal number of hash functions k = ln 2( m / n ). Bloom

filters were first used by TAPER system [26] depicting ease of implementation and

fast performance.

13.2.3 C ounting anD w inDow b loom F filters

Counting Bloom filters [16] were introduced to support the scenario where the con-

tents of a set change over time due to insertions and deletions. In this approach the

bits were replaced by small counters that stored the number of elements hashed onto

the particular bit position, and were updated with insertion and deletion of elements.

However, the support for deletion operations from the structure now gives rise to

false negatives. To meet the needs of varied application scenarios, a large number

of Bloom filter variants were proposed such as the compressed Bloom filter [31],

space-code Bloom filter [27], and spectral Bloom filter [35] to name but a few. Even

window model of Bloom filters were proposed [30] such as landmark window, jump-

ing window, sliding window [36], etc. These models operate on a definite amount

of history of objects observed in the stream to draw conclusions for processing of

future elements.

Bloom filters have even been applied to network related applications such as find-

ing heavy flows for stochastically fair blue queue management [17], packet classi-

fication [6], per-flow state management, and longest prefix matching [13]. Multiple

Bloom filters in conjunction with hash tables have been studied to represent items

with multiple attributes accurately and efficiently with low false-positive rates [25].

Bloomjoin used for distributed joins have also been extended to minimize network

usage for query execution based on database statistics. Bloom filters have also been

used for speeding up name-to-location resolution processes [29].

13.2.4 P arallel b loom F filters

To cater to enormous data in the order of petabytes and multiple input stream sce-

narios, parallel variants of Bloom filters have been explored [20]. Adhering to the

real-time needs of de-duplication applications, even disk based Bloom filters exist

in the literature. The advent of MapReduce and other techniques for parallelization

Search WWH ::

Custom Search

Home