Database Reference
In-Depth Information
from a 1/20th sample of the data. For each of the queries below, indicate how you would
construct the sample. That is, tell what the key attributes should be.
(a) For each university, estimate the average number of students in a course.
(b) Estimate the fraction of students who have a GPA of 3.5 or more.
(c) Estimate the fraction of courses where at least half the students got “A.”
4.3 Filtering Streams
Another common process on streams is selection, or filtering. We want to accept those
tuples in the stream that meet a criterion. Accepted tuples are passed to another process as
a stream, while other tuples are dropped. If the selection criterion is a property of the tuple
that can be calculated (e.g., the first component is less than 10), then the selection is easy to
do. The problem becomes harder when the criterion involves lookup for membership in a
set. It is especially hard, when that set is too large to store in main memory. In this section,
we shall discuss the technique known as “Bloom filtering” as a way to eliminate most of
the tuples that do not meet the criterion.
4.3.1
A Motivating Example
Again let us start with a running example that illustrates the problem and what we can do
about it. Suppose we have a set S of one billion allowed email addresses - those that we
will allow through because we believe them not to be spam. The stream consists of pairs:
an email address and the email itself. Since the typical email address is 20 bytes or more,
it is not reasonable to store S in main memory. Thus, we can either use disk accesses to de-
termine whether or not to let through any given stream element, or we can devise a method
that requires no more main memory than we have available, and yet will filter most of the
undesired stream elements.
Suppose for argument's sake that we have one gigabyte of available main memory. In
the technique known as Bloom filtering , we use that main memory as a bit array. In this
case, we have room for eight billion bits, since one byte equals eight bits. Devise a hash
function h from email addresses to eight billion buckets. Hash each member of S to a bit,
and set that bit to 1. All other bits of the array remain 0.
Since there are one billion members of S , approximately 1/8th of the bits will be 1. The
exact fraction of bits set to 1 will be slightly less than 1/8th, because it is possible that two
members of S hash to the same bit. We shall discuss the exact fraction of 1s in Section
4.3.3 . When a stream element arrives, we hash its email address. If the bit to which that
Search WWH ::




Custom Search