Approximating Streaming Data with Sketching - Real-Time Analytics

Database Reference

In-Depth Information

counters[j] = C;

} catch(IOException e) {

}

return true;

}

The calculation of set membership and estimates of the size of the filter

remain the same.

The downside of this approach is that it introduces the ability to produce

false negatives. To minimize the probability of a false negative in practice,

the authors of the original paper conducted a number of empirical

experiments using different values of d and C . The general rule of thumb

is to keep C as small as possible, because the larger C gets the larger d

needs to be. This introduces more steps in each update and slows the overall

algorithm. After a size for the counters has been established, the number of

elements to decrement is set via empirical testing of the sketch.

Distinct Value Sketches

The Bloom Filter is a fairly flexible algorithm for approximating set

membership and cardinality. To only approximate the cardinality of the set

it is possible to improve the Bloom Filter's error while decreasing memory

requirements using a specific Distinct Value (DV) Sketch. A variety of DV

sketch algorithms have been developed over the years. The various

approaches usually take on one of two different flavors, and this section

discusses one of each type.

The Min-Count Algorithm

The first type category, represented by the Min-Count algorithm (Frédéric

Giroire, “Order Statistics and Estimating Cardinalities of massive Data

Sets,” Discrete Applied Mathematics, 2009, Vol 157 Issue 2 (January),

406-427), relies on the distribution of order statistics. To understand how

this works, recall the earlier discussion of hash functions. A good hash

function takes an input value and maps it into a space of 2 p values such that

the distribution of values is uniform in [0,2 p -1].

Search WWH ::

Custom Search

Home