Database Reference
In-Depth Information
13.1 INTRODUCTION AND MOTIVATION
Unprecedented technological advancements in diverse fields have led to an explosion
of the amount of data stored worldwide. With the volume of information breaking
the barrier of petabytes, efficient management, indexing, retrieval, and processing
has occupied a central research focus in most data-intensive applications. Myriad
sources such as telecommunication call data records, telescopic imagery, online
transactions, web pages, stock markets, climate warning systems, medical records,
etc., demand resource and compute efficiency for such massively exponential data.
Removal of redundancy from such multibillion record data sets constitutes an impor-
tant area of study.
Intelligent Compression ” or data de-duplication in streaming scenarios, for com-
putational efficiency in downstream processing, calls for precise identification and
elimination of duplicates on-the-fly from such unbounded data streams. The real-
time nature of such applications requiring processing capacity of greater than 1 GB/s
constitutes an even greater challenge in this domain. This chapter highlights the
problems of detecting and eliminating redundant records arriving in large streaming
data sets and provides to the readers a detailed description of the state-of-art solu-
tions to tackle this ever-growing problem.
Consider a national telecommunication network that generates call data records
( CDR ) and stores important information such as the callee number, caller number,
duration, etc., for further analytics. However, redundant or duplicate records may
be generated due to errors in the procedure. Storing of such billions of CDRs in
real time in the central data repository calls for duplicate detection and removal to
enhance performance. Typical approaches involving the use of database queries or
Bloom filter [20] methods are prohibitively slow due to disk access, defeating the
real-time criteria of most applications, or are extremely resource intensive requiring
around 20 GB for storing the entire data (~6 billion CDRs) for exact match. Even
disk-based algorithms [20] have a heavy performance impact. Hence, there is a para-
mount need for de-duplication algorithms involving in-memory operations, real-time
performance along with tolerable false-positive (FP) and false-negative (FN) rates.
The growth of search engines is another area where de-duplication algorithms are
important. The search engines need to regularly crawl the web to extract new URLs
and update their corpus. Given a list of extracted URLs, the search engines need to
perform a probe of its corpus to identify if the current URL is already present in its
corpus [24]. This calls for efficient duplicate detection, wherein a small performance
hit can be tolerated. A high FN rate (FNR) triggering recrawling of URLs will lead
to a severe performance degradation of the search engine, and a high FP rate (FPR)
leading to new URLs being ignored will produce a stale corpus. Hence, a balance in
both FPR and FNR needs to be targeted.
Another interesting application for approximate duplicate detection in streaming
environment is the detection of fraudulent advertiser clicks [30]. In the web adver-
tising domain, for the sake of profit it is possible that the publisher fakes a certain
amount of the clicks (using scripts). Thus, there is a strong need to detect such mal-
practices. Detection of same user ID or click generation IP in these cases can help in
minimizing frauds. Solutions involving database access or probing of the archives
Search WWH ::




Custom Search