Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

13.1 INTRODUCTION AND MOTIVATION

Unprecedented technological advancements in diverse fields have led to an explosion

of the amount of data stored worldwide. With the volume of information breaking

the barrier of petabytes, efficient management, indexing, retrieval, and processing

has occupied a central research focus in most data-intensive applications. Myriad

sources such as telecommunication call data records, telescopic imagery, online

transactions, web pages, stock markets, climate warning systems, medical records,

etc., demand resource and compute efficiency for such massively exponential data.

Removal of redundancy from such multibillion record data sets constitutes an impor-

tant area of study.

“ Intelligent Compression ” or data de-duplication in streaming scenarios, for com-

putational efficiency in downstream processing, calls for precise identification and

elimination of duplicates on-the-fly from such unbounded data streams. The real-

time nature of such applications requiring processing capacity of greater than 1 GB/s

constitutes an even greater challenge in this domain. This chapter highlights the

problems of detecting and eliminating redundant records arriving in large streaming

data sets and provides to the readers a detailed description of the state-of-art solu-

tions to tackle this ever-growing problem.

Consider a national telecommunication network that generates call data records

( CDR ) and stores important information such as the callee number, caller number,

duration, etc., for further analytics. However, redundant or duplicate records may

be generated due to errors in the procedure. Storing of such billions of CDRs in

real time in the central data repository calls for duplicate detection and removal to

enhance performance. Typical approaches involving the use of database queries or

Bloom filter [20] methods are prohibitively slow due to disk access, defeating the

real-time criteria of most applications, or are extremely resource intensive requiring

around 20 GB for storing the entire data (~6 billion CDRs) for exact match. Even

disk-based algorithms [20] have a heavy performance impact. Hence, there is a para-

mount need for de-duplication algorithms involving in-memory operations, real-time

performance along with tolerable false-positive (FP) and false-negative (FN) rates.

The growth of search engines is another area where de-duplication algorithms are

important. The search engines need to regularly crawl the web to extract new URLs

and update their corpus. Given a list of extracted URLs, the search engines need to

perform a probe of its corpus to identify if the current URL is already present in its

corpus [24]. This calls for efficient duplicate detection, wherein a small performance

hit can be tolerated. A high FN rate (FNR) triggering recrawling of URLs will lead

to a severe performance degradation of the search engine, and a high FP rate (FPR)

leading to new URLs being ignored will produce a stale corpus. Hence, a balance in

both FPR and FNR needs to be targeted.

Another interesting application for approximate duplicate detection in streaming

environment is the detection of fraudulent advertiser clicks [30]. In the web adver-

tising domain, for the sake of profit it is possible that the publisher fakes a certain

amount of the clicks (using scripts). Thus, there is a strong need to detect such mal-

practices. Detection of same user ID or click generation IP in these cases can help in

minimizing frauds. Solutions involving database access or probing of the archives

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home