Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

combination using the inclusion-exclusion principle: |featureA:valueX|

+ |featureB:valueY| - |featureA:valueX union featureB:valueY|.

So long as the number of different values each feature can take on is

small, this can be done interactively. If the possible values each feature

can take on is very large, the best approach is to use the Count-Min

sketch to identify the most common feature values to control the

number presented in the pivot table user interface.

The Count-Min Sketch

The final sketch algorithm to discuss is the popular Count-Min sketch. This

algorithm essentially approximates a map (also called a multi-set) where

the distinct values act as the keys and the number of times an element

has been seen is the value. The simplest implementations of the Count-Min

sketchsupportapproximatelookupsinthismap,called point queries .When

the distinct values have some sort of intrinsic ordering, the sketch can be

considered an approximation of a histogram (also known as a frequency

table).

With a bit more work, the Count-Min sketch can also support range queries

that approximate the frequency of elements in the range [a,b]. After range

queries are supported, it is also possible to estimate order statistics using a

binary search on the ranges. This is discussed in detail later in this chapter.

Search WWH ::

Custom Search

Home