Database Reference
In-Depth Information
4.9 References for Chapter 4
Many ideas associated with stream management appear in the “chronicle data model” of
[ 8 ] . An early survey of research in stream-management systems is [ 2 ]. Also, [ 6 ] is a recent
book on the subject of stream management.
The sampling technique of Section 4.2 is from [ 7 ]. The Bloom Filter is generally attrib-
uted to [ 3 ] , although essentially the same technique appeared as “superimposed codes” in
[ 9 ] .
The algorithm for counting distinct elements is essentially that of [ 5 ] , although the par-
ticular method we described appears in [ 1 ] . The latter is also the source for the algorithm
for calculating the surprise number and higher moments. However, the technique for main-
taining a uniformly chosen sample of positions in the stream is called “reservoir sampling”
and comes from [ 10 ] .
The technique for approximately counting 1s in a window is from [ 4 ] .
[1] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approximating frequency moments,” 28th ACM
Symposium on Theory of Computing , pp. 20-29, 1996.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,” Sym-
posium on Principles of Database Systems , pp. 1-16, 2002.
[3] B.H. Bloom, “Space/time trade-offs in hash coding with allowable errors,” Comm. ACM 13:7, pp. 422-426, 1970.
[4] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream statistics over sliding windows,” SIAM J.
Computing 31, pp. 1794-1813, 2002.
[5] P. Flajolet and G.N. Martin, “Probabilistic counting for database applications,” 24th Symposium on Foundations
of Computer Science , pp. 76-82, 1983.
[6] M. Garofalakis, J. Gehrke, and R. Rastogi (editors), Data Stream Management , Springer, 2009.
[7] P.B. Gibbons, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,” Intl.
Conf. on Very Large Databases , pp. 541-550, 2001.
[8] H.V. Jagadish, I.S. Mumick, and A. Silberschatz, “View maintenance issues for the chronicle data model,” Proc.
ACM Symp. on Principles of Database Systems , pp. 113-124, 1995.
[9] W.H. Kautz and R.C. Singleton, “Nonadaptive binary superimposed codes,” IEEE Transactions on Information
Theory 10, pp. 363-377, 1964.
[10] J. Vitter, “Random sampling with a reservoir,” ACM Transactions on Mathematical Software 11:1, pp. 37-57,
1985.
1 While we shall refer to “users,” the search engine really receives IP addresses from which the search query was is-
sued. We shall assume that these IP addresses identify unique users, which is approximately true, but not exactly true.
2 At least that will be the case until IPv6 becomes the norm.
3 Technically, since the hash value is a bit-string of finite length, there is no contribution to 2 R for R s that are larger than
the length of the hash value. However, this effect is not enough to avoid the conclusion that the expected value of 2 R
is much too large.
4 Technically, since m i could be 0 for some elements in the universal set, we need to make explicit in the definition of
“moment” that 0 0 is taken to be 0. For moments 1 and above, the contribution of m i that are 0 is surely 0.
5 Do not confuse these “buckets” with the “buckets” discussed in connection with hashing.
Search WWH ::




Custom Search