Database Reference
In-Depth Information
13.2.2 b loom F filters
Bloom filters [8,9] are space-efficient probabilistic bit-vector data structures pro-
viding fast membership queries on sets [8]. A Bloom filter consists of a bit vector
and k hash functions. An element is hashed onto k bit positions of the Bloom filter
using the hash functions. Typical Bloom filter approaches involve k comparisons for
each record for checking the corresponding bit positions of the Bloom filter array.
However, the efficiency of Bloom filters come at the cost of a small false-positive
rate, wherein the Bloom filter falsely reports the presence of the query element. This
occurs due to hash collision of multiple elements onto a single bit position of the
Bloom filter. Interestingly false negatives are never encountered. The probability of
a false positive for a standard Bloom filter is given by [9]: FPR ≈ (l − e kn / m ) k , where
m is the number of elements in the stream and n is the number of distinct elements
observed. Given n and m , the optimal number of hash functions k = ln 2( m / n ). Bloom
filters were first used by TAPER system [26] depicting ease of implementation and
fast performance.
13.2.3 C ounting anD w inDow b loom F filters
Counting Bloom filters [16] were introduced to support the scenario where the con-
tents of a set change over time due to insertions and deletions. In this approach the
bits were replaced by small counters that stored the number of elements hashed onto
the particular bit position, and were updated with insertion and deletion of elements.
However, the support for deletion operations from the structure now gives rise to
false negatives. To meet the needs of varied application scenarios, a large number
of Bloom filter variants were proposed such as the compressed Bloom filter [31],
space-code Bloom filter [27], and spectral Bloom filter [35] to name but a few. Even
window model of Bloom filters were proposed [30] such as landmark window, jump-
ing window, sliding window [36], etc. These models operate on a definite amount
of history of objects observed in the stream to draw conclusions for processing of
future elements.
Bloom filters have even been applied to network related applications such as find-
ing heavy flows for stochastically fair blue queue management [17], packet classi-
fication [6], per-flow state management, and longest prefix matching [13]. Multiple
Bloom filters in conjunction with hash tables have been studied to represent items
with multiple attributes accurately and efficiently with low false-positive rates [25].
Bloomjoin used for distributed joins have also been extended to minimize network
usage for query execution based on database statistics. Bloom filters have also been
used for speeding up name-to-location resolution processes [29].
13.2.4 P arallel b loom F filters
To cater to enormous data in the order of petabytes and multiple input stream sce-
narios, parallel variants of Bloom filters have been explored [20]. Adhering to the
real-time needs of de-duplication applications, even disk based Bloom filters exist
in the literature. The advent of MapReduce and other techniques for parallelization
Search WWH ::




Custom Search