Database Reference
In-Depth Information
are prohibitively compute expensive and time consuming. Naïve string matching
algorithms also do not provide a feasible solution to the demand of in-memory real-
time algorithms for the data redundancy removal (DRR) problem.
Most duplicate detection approaches use the Bloom filter structure [8,9]. The lit-
erature hosts a large number of Bloom filter variations such as the counting [16],
compressed [31], space-code, and spectral Bloom filters to name a few. Even sliding
window models [30], and Bloom filter arrays have been proposed. Further, memory-
efficient de-duplication algorithms and their parallel versions have also been pro-
posed in [20]. The inherent limitations of the existing procedures while considering
100s of petabytes of data and also in the streaming scenarios have been studied. This
has led to the development of new innovative approaches for the massive or stream-
ing de-duplication problem.
This chapter delves into solutions pertaining to approximate duplicate detection,
wherein the application can tolerate a small fraction of false-positive and false-
negative results. The event of false positive (FP) occurs when a distinct element is
wrongly reported as duplicate. The opposite scenario of false negative (FN) occurs
when a duplicate entity is falsely reported as nonduplicate. Since the in-memory stor-
age of all elements from a possibly unbounded stream is infeasible, de-duplication
algorithms tend to minimize the FP and FN rates.
This chapter introduces in details the working principle of the different variants
of de-duplication techniques, their merits, and disadvantages. Most approaches suffer
from high FP and FN rates or are ill-suited for the applications at hand. However, two
recent Bloom filter-based algorithms, stable Bloom filter, (SBF) [12], and reservoir
sampling-based stable Bloom filter (RSBF) [15] guaranteeing a novel characteristic,
stability , providing high-performance throughput have been discussed in this chapter.
The SBF and RSBF currently offers the best performance guarantees for duplicate
detection. Coupled with the stability property, they far outperform other existing
methods in this domain. These structures and algorithms demonstrate that their FP
and FN rates become constant as the stream length increases, thereby making them
extremely attractive for de-duplication applications. Hence, detailed insights into the
working principles of these state-of-art algorithms are presented to the reader. A com-
plete theoretical analysis of FPR, FNR, and stability of these two approaches along
with their experimental comparison for different data have also been discussed here.
This chapter aims to establish an exhaustive and detailed reference to the current
scenarios in the de-duplication problem, and highlight future roadmaps, and chal-
lenges in the de-duplication problem for petabytes of data. In particular, this chap-
ter presents to the readers RSBF as the state-of-art data structure for approximate
duplicate detection combining attractive features such as low error rates, theoretical
guarantees, stability, and low memory requirement. It is in this context that SBF has
been selected for direct comparison.
The next section describes the various existing de-duplication methods and data
structures. Section 13.3 introduces the working concepts of the stable Bloom filter, ,
theoretical properties and its stability property. Section 13.4 then briefly discusses
the method of reservoir sampling . Section 13.5 provides the detailed algorithmic
overview and theoretical bounds for the reservoir sampling-based stable Bloom
filter, , offering the best error rates. Empirical results comparing the performance of
Search WWH ::




Custom Search