Database Reference
In-Depth Information
This can be further generalized to provide an expression for the cardinality
of the union of any number of sets: Add the cardinality of all the sets.
Subtract the cardinality of the intersection of each pair of sets. Add the
cardinalityoftheintersectiontoeachcombinationofthreesets.Subtractthe
cardinality of the intersections of each combination of four sets, and so on.
Although this allows for the approximation of arbitrarily large intersections
of sets, it has two drawbacks:
• The total number of operations required grows large very quickly.
• The Bloom Filter and Distinct Value sketch algorithms presented in this
chapter each have errors associated with them, and the errors
accumulate with each addition or subtraction. Even the intersection of
two sets will have an estimation error three times larger than the error
in the estimated size of any one set or any union of sets.
The Bloom Filter
Bloom filters are a data structure used by a variety of applications to store
set membership information. This data structure is compact and does not
depend on the number of items to be stored in the set. The tradeoff is that
the Bloom Filter may incorrectly report that an item is in the set when it
is not—a false positive. It will never report a false negative. It is this false
positive rate that is controlled by the size of the data structure.
The Algorithm
The Bloom Filter approximately represents a set using an array of m 1-bit
registers and k independent hash functions. In Java, a BitSet and k
MurmurHash functions, each seeded with independent random values, can
be used to implement the java.util.Set<E> interface:
public class BloomSet<E extends Serializable>
implements Set<E> {
BitSet bits;
Search WWH ::




Custom Search