Database Reference
In-Depth Information
the two techniques (SBF and RSBF) over varied data sets along with the setting of
parameters are depicted in Section 13.6. Finally, Section 13.7 concludes the chapter
summarizing the findings and also citing possible future work directions.
13.2 DUPLICATE DETECTION APPROACHES
AND DATA STRUCTURES
Duplicate detection provides a classical problem within the ambit of data storage
and databases giving rise to numerous buffering solutions. With the advent of online
arrival of data and transactions, detection of duplicates in such streaming environ-
ment using buffering and caching mechanisms [19] corresponds to a naïve solution.
The inability to store all the data arriving on the stream led to the design of approxi-
mate de-duplication methods.
Management of large data streams for computing approximate frequency
moments [4], element classification [23], correlated aggregate queries [21], and oth-
ers with limited memory and acceptable error rates have become a spotlight among
the research community. Bit Shaving , the problem of fraudulent advertisers not pay-
ing commission for a certain amount of the traffic or hits have also been studied
in this context [34]. This prompted the growth of approximate duplicate detection
techniques in the area of both databases and web applications. Redundancy removal
algorithms for search engines were first studied in [10,11,28].
Formally, the duplicate detection problem is defined as: Given a data stream,
S, and fixed memory space, M, we need to report if an element ei i in S has already
occurred previously among e 1 , e 2 ,…, e i −1 or not . As the storage of the entire stream,
which can possibly be unbounded, is infeasible, an approximate estimate minimiz-
ing the error rates is required.
13.2.1 D atabase Q uery anD b uFFering t eChniQues
Straightforward approaches using traditional database queries or pairwise string
matching algorithms become prohibitively slow involving disk accesses, defeating
the real-time demands of the streaming applications. Naïve caching and buffering
methods [19] involve populating a fixed-sized buffer memory with elements arriving
on the stream. The membership query of an element is then performed by checking
if the element is present in the buffer or not. However, once the buffer becomes com-
pletely filled, the storage of a new element involves the eviction of an element present
in the buffer. Several such eviction or replacement policies have been studied in [19].
Nevertheless, the performance of these methods depends heavily on the choice of the
replacement policy adopted.
To study the performance of approximate duplicate detection mechanisms, fuzzy
algorithms were also proposed in [7,39]. Bit shaving in the context of fraudulent
advertiser traffic was studied in [34], and duplicate detection for web applications
was presented in [10,11,28]. File-level hashing was used in storage systems for
de-duplication [1,14,37], but they provided a low compression ratio. Even secure
hashes were proposed for fixed-sized data blocks [33].
Search WWH ::




Custom Search