Mining Data Streams - Mining of Massive Datasets

Database Reference

In-Depth Information

4.9 References for Chapter 4

Many ideas associated with stream management appear in the “chronicle data model” of

[ 8 ] . An early survey of research in stream-management systems is [ 2 ]. Also, [ 6 ] is a recent

book on the subject of stream management.

The sampling technique of Section 4.2 is from [ 7 ]. The Bloom Filter is generally attrib-

uted to [ 3 ] , although essentially the same technique appeared as “superimposed codes” in

[ 9 ] .

The algorithm for counting distinct elements is essentially that of [ 5 ] , although the par-

ticular method we described appears in [ 1 ] . The latter is also the source for the algorithm

for calculating the surprise number and higher moments. However, the technique for main-

taining a uniformly chosen sample of positions in the stream is called “reservoir sampling”

and comes from [ 10 ] .

The technique for approximately counting 1s in a window is from [ 4 ] .

[1] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approximating frequency moments,” 28th ACM

Symposium on Theory of Computing , pp. 20-29, 1996.

[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” Sym-

posium on Principles of Database Systems , pp. 1-16, 2002.

[3] B.H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Comm. ACM 13:7, pp. 422-426, 1970.

[4] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream statistics over sliding windows,” SIAM J.

Computing 31, pp. 1794-1813, 2002.

[5] P. Flajolet and G.N. Martin, “Probabilistic counting for database applications,” 24th Symposium on Foundations

of Computer Science , pp. 76-82, 1983.

[6] M. Garofalakis, J. Gehrke, and R. Rastogi (editors), Data Stream Management , Springer, 2009.

[7] P.B. Gibbons, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,” Intl.

Conf. on Very Large Databases , pp. 541-550, 2001.

[8] H.V. Jagadish, I.S. Mumick, and A. Silberschatz, “View maintenance issues for the chronicle data model,” Proc.

ACM Symp. on Principles of Database Systems , pp. 113-124, 1995.

[9] W.H. Kautz and R.C. Singleton, “Nonadaptive binary superimposed codes,” IEEE Transactions on Information

Theory 10, pp. 363-377, 1964.

[10] J. Vitter, “Random sampling with a reservoir,” ACM Transactions on Mathematical Software 11:1, pp. 37-57,

1985.

1 While we shall refer to “users,” the search engine really receives IP addresses from which the search query was is-

sued. We shall assume that these IP addresses identify unique users, which is approximately true, but not exactly true.

2 At least that will be the case until IPv6 becomes the norm.

3 Technically, since the hash value is a bit-string of finite length, there is no contribution to 2 R for R s that are larger than

the length of the hash value. However, this effect is not enough to avoid the conclusion that the expected value of 2 R

is much too large.

4 Technically, since m i could be 0 for some elements in the universal set, we need to make explicit in the definition of

“moment” that 0 0 is taken to be 0. For moments 1 and above, the contribution of m i that are 0 is surely 0.

5 Do not confuse these “buckets” with the “buckets” discussed in connection with hashing.

Search WWH ::

Custom Search

Home