Database Reference
In-Depth Information
13.7 CONCLUSIONS
Real-time de-duplication or data redundancy removal (DRR) for streaming data sets
poses a challenging problem. This chapter has presented novel Bloom filter based
algorithms to tackle the problem efficiently. The stable Bloom filter (SBF) uniquely
proposed the stability condition leading to constant FP and FN rates. However, the
stability is achieved at infinite stream length theoretically and SBF suffers from high
FNR. The constant guarantee on error rates makes SBF an extremely efficiently and
attractive data structure for DRR applications.
Using a novel combination of reservoir sampling and Bloom filters, the reservoir
sampling-based stable Bloom filter (RSBF) obtained enhanced FNR and faster con-
vergence to stability at comparable FPR with that of SBF. The stability property is
adhered to at each element of the stream. RSBF thus eliminates the problems faced
by SBF, and thus currently offers the best technique for de-duplication.
This chapter discussed at length the state-of-art structures for de-duplication
along with a broad theoretical outline. Detailed discussions on SBF and RSBF and
their working principles have been presented to the readers. Real-time in-memory
de-duplication experiments using both real and synthetically generated data sets
of up to 1 billion records have also been presented and compared. Despite such
advancements and enhanced performances, the effects of other biasing and sampling
functions to further decrease the FNR and FPR simultaneously along with other
improved data structures and parallel algorithms remain an open direction of future
work. It would also be interesting to study the effects of reservoir sampling on SBF.
REFERENCES
1. A. Adya, W.J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J.R. Douceur, J. Howell,
R.J. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable
storage for an incompletely trusted environment. In OSDI , pages 1-14, 2002.
2. C. Aggarwal and P. Yu. Data Streams: Models and Algorithms. Springer, 2007.
3. C.C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In
VLDB , pages 607-618, 2006.
4. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the fre-
quency moments. In STOC , pages 20-29, 1996.
5. B. Babcock, M. Datar, and R. Motwani. Sampling from moving window over streaming
data. In SODA , pages 633-634, 2002.
6. F. Baboescu and G. Varghese. Scalable packet classification. In ACM SIGCOMM , pages
199-210, 2001.
7. M. Bilenko and R.J. Mooney. Adaptive duplicate detection using learnable string simi-
larity measures. In Proc. SIGKDD , pages 39-48, 2003.
8. B.H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun.
ACM , 13(7):422-426, 1970.
9. A.Z. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey.
Internet Math. , 1(4):485-509, 2003.
10. A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast
duplicate document detection. ACM Trans. Inform. Syst. , 20(2):171-191, 2002.
11. J. Conrad, X. Guo, and C. Schriber. Online duplicate document detection: Signature
reliability in a dynamic retrieval environment. In CIKM , pages 443-452, 2003.
Search WWH ::




Custom Search