Advanced Algorithms for Efficient Approximate Duplicate Detection in Data Streams Using Bloom Filters - Large Scale and Big Data: Processing and Management

Database Reference

In-Depth Information

13.7 CONCLUSIONS

Real-time de-duplication or data redundancy removal (DRR) for streaming data sets

poses a challenging problem. This chapter has presented novel Bloom filter based

algorithms to tackle the problem efficiently. The stable Bloom filter (SBF) uniquely

proposed the stability condition leading to constant FP and FN rates. However, the

stability is achieved at infinite stream length theoretically and SBF suffers from high

FNR. The constant guarantee on error rates makes SBF an extremely efficiently and

attractive data structure for DRR applications.

Using a novel combination of reservoir sampling and Bloom filters, the reservoir

sampling-based stable Bloom filter (RSBF) obtained enhanced FNR and faster con-

vergence to stability at comparable FPR with that of SBF. The stability property is

adhered to at each element of the stream. RSBF thus eliminates the problems faced

by SBF, and thus currently offers the best technique for de-duplication.

This chapter discussed at length the state-of-art structures for de-duplication

along with a broad theoretical outline. Detailed discussions on SBF and RSBF and

their working principles have been presented to the readers. Real-time in-memory

de-duplication experiments using both real and synthetically generated data sets

of up to 1 billion records have also been presented and compared. Despite such

advancements and enhanced performances, the effects of other biasing and sampling

functions to further decrease the FNR and FPR simultaneously along with other

improved data structures and parallel algorithms remain an open direction of future

work. It would also be interesting to study the effects of reservoir sampling on SBF.

REFERENCES

1. A. Adya, W.J. Bolosky, M. Castro, G. Cermak, R. Chaiken, J.R. Douceur, J. Howell,

R.J. Lorch, M. Theimer, and R. Wattenhofer. Farsite: Federated, available, and reliable

storage for an incompletely trusted environment. In OSDI , pages 1-14, 2002.

2. C. Aggarwal and P. Yu. Data Streams: Models and Algorithms. Springer, 2007.

3. C.C. Aggarwal. On biased reservoir sampling in the presence of stream evolution. In

VLDB , pages 607-618, 2006.

4. N. Alon, Y. Matias, and M. Szegedy. The space complexity of approximating the fre-

quency moments. In STOC , pages 20-29, 1996.

5. B. Babcock, M. Datar, and R. Motwani. Sampling from moving window over streaming

data. In SODA , pages 633-634, 2002.

6. F. Baboescu and G. Varghese. Scalable packet classification. In ACM SIGCOMM , pages

199-210, 2001.

7. M. Bilenko and R.J. Mooney. Adaptive duplicate detection using learnable string simi-

larity measures. In Proc. SIGKDD , pages 39-48, 2003.

8. B.H. Bloom. Space/time trade-offs in hash coding with allowable errors. Commun.

ACM , 13(7):422-426, 1970.

9. A.Z. Broder and M. Mitzenmacher. Network applications of bloom filters: A survey.

Internet Math. , 1(4):485-509, 2003.

10. A. Chowdhury, O. Frieder, D. Grossman, and M. McCabe. Collection statistics for fast

duplicate document detection. ACM Trans. Inform. Syst. , 20(2):171-191, 2002.

11. J. Conrad, X. Guo, and C. Schriber. Online duplicate document detection: Signature

reliability in a dynamic retrieval environment. In CIKM , pages 443-452, 2003.

Large Scale and Big Data: Processing and Management

Search WWH ::

Custom Search

Home