Database Reference
In-Depth Information
4.9 References for Chapter 4
Many ideas associated with stream management appear in the “chronicle data model” of
book on the subject of stream management.
The sampling technique of
Section 4.2
is from [
7
]. The Bloom Filter is generally attrib-
for calculating the surprise number and higher moments. However, the technique for main-
taining a uniformly chosen sample of positions in the stream is called “reservoir sampling”
[1] N. Alon, Y. Matias, and M. Szegedy, “The space complexity of approximating frequency moments,”
28th ACM
Symposium on Theory of Computing
, pp. 20-29, 1996.
[2] B. Babcock, S. Babu, M. Datar, R. Motwani, and J. Widom, “Models and issues in data stream systems,”
Sym-
posium on Principles of Database Systems
, pp. 1-16, 2002.
[3] B.H. Bloom, “Space/time trade-offs in hash coding with allowable errors,”
Comm. ACM
13:7, pp. 422-426, 1970.
[4] M. Datar, A. Gionis, P. Indyk, and R. Motwani, “Maintaining stream statistics over sliding windows,”
SIAM J.
Computing
31, pp. 1794-1813, 2002.
[5] P. Flajolet and G.N. Martin, “Probabilistic counting for database applications,”
24th Symposium on Foundations
of Computer Science
, pp. 76-82, 1983.
[6] M. Garofalakis, J. Gehrke, and R. Rastogi (editors),
Data Stream Management
, Springer, 2009.
[7] P.B. Gibbons, “Distinct sampling for highly-accurate answers to distinct values queries and event reports,”
Intl.
Conf. on Very Large Databases
, pp. 541-550, 2001.
[8] H.V. Jagadish, I.S. Mumick, and A. Silberschatz, “View maintenance issues for the chronicle data model,”
Proc.
ACM Symp. on Principles of Database Systems
, pp. 113-124, 1995.
[9] W.H. Kautz and R.C. Singleton, “Nonadaptive binary superimposed codes,”
IEEE Transactions on Information
Theory
10, pp. 363-377, 1964.
[10] J. Vitter, “Random sampling with a reservoir,”
ACM Transactions on Mathematical Software
11:1, pp. 37-57,
1985.
1
While we shall refer to “users,” the search engine really receives IP addresses from which the search query was is-
sued. We shall assume that these IP addresses identify unique users, which is approximately true, but not exactly true.
2
At least that will be the case until IPv6 becomes the norm.
3
Technically, since the hash value is a bit-string of finite length, there is no contribution to 2
R
for
R
s that are larger than
the length of the hash value. However, this effect is not enough to avoid the conclusion that the expected value of 2
R
is much too large.
4
Technically, since
m
i
could be 0 for some elements in the universal set, we need to make explicit in the definition of
“moment” that 0
0
is taken to be 0. For moments 1 and above, the contribution of
m
i
that are 0 is surely 0.
5
Do not confuse these “buckets” with the “buckets” discussed in connection with hashing.