Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

are prohibitively compute expensive and time consuming. Naïve string matching

algorithms also do not provide a feasible solution to the demand of in-memory real-

time algorithms for the data redundancy removal (DRR) problem.

Most duplicate detection approaches use the Bloom filter structure [8,9]. The lit-

erature hosts a large number of Bloom filter variations such as the counting [16],

compressed [31], space-code, and spectral Bloom filters to name a few. Even sliding

window models [30], and Bloom filter arrays have been proposed. Further, memory-

efficient de-duplication algorithms and their parallel versions have also been pro-

posed in [20]. The inherent limitations of the existing procedures while considering

100s of petabytes of data and also in the streaming scenarios have been studied. This

has led to the development of new innovative approaches for the massive or stream-

ing de-duplication problem.

This chapter delves into solutions pertaining to approximate duplicate detection,

wherein the application can tolerate a small fraction of false-positive and false-

negative results. The event of false positive (FP) occurs when a distinct element is

wrongly reported as duplicate. The opposite scenario of false negative (FN) occurs

when a duplicate entity is falsely reported as nonduplicate. Since the in-memory stor-

age of all elements from a possibly unbounded stream is infeasible, de-duplication

algorithms tend to minimize the FP and FN rates.

This chapter introduces in details the working principle of the different variants

of de-duplication techniques, their merits, and disadvantages. Most approaches suffer

from high FP and FN rates or are ill-suited for the applications at hand. However, two

recent Bloom filter-based algorithms, stable Bloom filter, (SBF) [12], and reservoir

sampling-based stable Bloom filter (RSBF) [15] guaranteeing a novel characteristic,

stability , providing high-performance throughput have been discussed in this chapter.

The SBF and RSBF currently offers the best performance guarantees for duplicate

detection. Coupled with the stability property, they far outperform other existing

methods in this domain. These structures and algorithms demonstrate that their FP

and FN rates become constant as the stream length increases, thereby making them

extremely attractive for de-duplication applications. Hence, detailed insights into the

working principles of these state-of-art algorithms are presented to the reader. A com-

plete theoretical analysis of FPR, FNR, and stability of these two approaches along

with their experimental comparison for different data have also been discussed here.

This chapter aims to establish an exhaustive and detailed reference to the current

scenarios in the de-duplication problem, and highlight future roadmaps, and chal-

lenges in the de-duplication problem for petabytes of data. In particular, this chap-

ter presents to the readers RSBF as the state-of-art data structure for approximate

duplicate detection combining attractive features such as low error rates, theoretical

guarantees, stability, and low memory requirement. It is in this context that SBF has

been selected for direct comparison.

The next section describes the various existing de-duplication methods and data

structures. Section 13.3 introduces the working concepts of the stable Bloom filter, ,

theoretical properties and its stability property. Section 13.4 then briefly discusses

the method of reservoir sampling . Section 13.5 provides the detailed algorithmic

overview and theoretical bounds for the reservoir sampling-based stable Bloom

filter, , offering the best error rates. Empirical results comparing the performance of

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home