Database Reference
In-Depth Information
Chapter 10
Approximating Streaming Data with Sketching
Questions about a set of distinct elements often arise in analytics, streaming
or otherwise. Common questions are things like “Have I seen this before?”
(set membership), “How many different things have I seen?” (cardinality
estimation), or “How often do I see this thing?” (frequency).
When the number of different elements is small (low cardinality), this can
be computed directly even in a streaming system. Some of the storage
mechanisms introduced in earlier chapters, such as Redis, even have
specialized data structures to allow for maintaining sets and histograms with
real-time updates.
However, when the cardinality of the set becomes large (that is, there are
many distinct items) direct maintenance of these sets becomes problematic.
These data structures require O(log n) update time and O(n) storage, which
can be infeasible in a streaming setting and expensive (at the very least) in
terms of hardware costs.
To combat these problems, a number of algorithms, collectively known as
sketch algorithms, have been developed to approximate the answers to these
questions. Sketch algorithms have three features that make them desirable.
The first feature is constant time updates of the data, which allows them
to be easily maintained in a streaming setting. Secondly, the storage space
needed is usually independent of the amount of data. Finally, querying the
data structure can be completed in at worst linear time. The downside is that
to achieve these properties, errors are introduced into the reported results.
In this sense, they are like the sampling approaches discussed in Chapter 9,
“Statistical Approximation of Streaming Data.”
This chapter discusses four sketch algorithms, each with a different focus or
performance characteristics. The Bloom Filter is a general set representation
that can be used to answer questions about set membership and cardinality.
It is also the oldest of the algorithms, and its variations are used in many
areas. The two Distinct Value Sketches—Min-Count and HyperLogLog
algorithms—use different approaches to approximate the size of a set, also
known as the cardinality . Finally, the Count-Min sketch approximates a
Search WWH ::




Custom Search