Database Reference
In-Depth Information
To improve the error, one approach would be to use the same approach
employed by the Bloom Filter, introducing m different hash functions to
produce m independent streams of data. Taking the arithmetic mean of
these m independentstreamsyieldsthesameexpectationasasinglestream,
but with a standard deviation proportional to thereby improving the
standard error. Using m independent hash functions is computationally
expensive, so Min-Count uses a technique called stochastic averaging ,
which was introduced by Flajolet in 1985 to simulate independent streams.
To simulate independent random streams, the first b bits of the hash value
are used to define 2 b registers where b is typically smaller than half the
number of bits used in the hash function. If each register that stores the
three smallest is normalized to be a number in [0,1] the cardinality estimate
becomes:
with a standard error of . To use a concrete example, a 32-bit hash function
with 65,536 registers would require 384k of storage. By contrast, a sampling
approach using the same 32-bit hash to reduce the storage requirements
would be able to store about 98,000 values in the same amount of space,
which may not be sufficient if some of the distinct values are quite rare and
the data stream very large.
Implementing Min-Count
Like the Bloom Filter implementation, the Min-Count implementation uses
the standard Java Set interface and works on serializable elements. This
implementation also uses the third smallest value to avoid the issues
mentioned in the previous section. Using a parameter b to define the
number of bits to use and defining 2b registers, the class is initialized as
follows:
public class MinCount<E extends Serializable>
implements Set<E> {
int b;
int[] M;
SerializableHasher hash = new SerializableHasher();
ObjectOutputStream out;
Search WWH ::




Custom Search