Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

the two techniques (SBF and RSBF) over varied data sets along with the setting of

parameters are depicted in Section 13.6. Finally, Section 13.7 concludes the chapter

summarizing the findings and also citing possible future work directions.

13.2 DUPLICATE DETECTION APPROACHES

AND DATA STRUCTURES

Duplicate detection provides a classical problem within the ambit of data storage

and databases giving rise to numerous buffering solutions. With the advent of online

arrival of data and transactions, detection of duplicates in such streaming environ-

ment using buffering and caching mechanisms [19] corresponds to a naïve solution.

The inability to store all the data arriving on the stream led to the design of approxi-

mate de-duplication methods.

Management of large data streams for computing approximate frequency

moments [4], element classification [23], correlated aggregate queries [21], and oth-

ers with limited memory and acceptable error rates have become a spotlight among

the research community. Bit Shaving , the problem of fraudulent advertisers not pay-

ing commission for a certain amount of the traffic or hits have also been studied

in this context [34]. This prompted the growth of approximate duplicate detection

techniques in the area of both databases and web applications. Redundancy removal

algorithms for search engines were first studied in [10,11,28].

Formally, the duplicate detection problem is defined as: Given a data stream,

S, and fixed memory space, M, we need to report if an element ei i in S has already

occurred previously among e 1 , e 2 ,…, e i −1 or not . As the storage of the entire stream,

which can possibly be unbounded, is infeasible, an approximate estimate minimiz-

ing the error rates is required.

13.2.1 D atabase Q uery anD b uFFering t eChniQues

Straightforward approaches using traditional database queries or pairwise string

matching algorithms become prohibitively slow involving disk accesses, defeating

the real-time demands of the streaming applications. Naïve caching and buffering

methods [19] involve populating a fixed-sized buffer memory with elements arriving

on the stream. The membership query of an element is then performed by checking

if the element is present in the buffer or not. However, once the buffer becomes com-

pletely filled, the storage of a new element involves the eviction of an element present

in the buffer. Several such eviction or replacement policies have been studied in [19].

Nevertheless, the performance of these methods depends heavily on the choice of the

replacement policy adopted.

To study the performance of approximate duplicate detection mechanisms, fuzzy

algorithms were also proposed in [7,39]. Bit shaving in the context of fraudulent

advertiser traffic was studied in [34], and duplicate detection for web applications

was presented in [10,11,28]. File-level hashing was used in storage systems for

de-duplication [1,14,37], but they provided a low compression ratio. Even secure

hashes were proposed for fixed-sized data blocks [33].

Search WWH ::

Custom Search

Home