Database Reference
In-Depth Information
2.3
Low Entropy Sets
The final absolute measure for interestingness we discuss is entropy . Whereas many
of the above measures put explicit importance on the associations between 1s in the
data, by ignoring or penalizing 0s. This, however, ignores the fact that there may
be interesting associations in the data between both the 1s and the 0s. Heikinheimo
et al. [ 30 ] hence argue to put equal importance on both 0 / 1, and instead of mining
frequent itemsets, propose to mine itemsets for which the counts of the contingency
table are highly skewed. That is, for an itemset X we calculate the support of all of
its 2 | X | instances, and calculate the entropy over these counts. The score is minimal
(0) when only one instance occurs in the data, e.g., if for itemset X
=
abc if we find
supp ( abc
=
110)
=|
D
|
, while the score is maximal (
|
X
|
) when all instances have
the same support.
Intuition An itemset X is interesting if the distribution of the data is highly skewed,
i.e., either highly structured or very random.
Using this score, which exhibits monotonicity, we can use the level-wise algorithm
to efficiently mine either low entropy sets, if one is interested in highly structured
parts of the data, or to mine high entropy sets if one is interested in identifying the
most random parts of the data. Mampaey [ 41 ] proposed to speed up the mining by
using inclusion-exclusion, making use of the fact that in practice only a fraction of
all 2 | X | possible instances of X occur in the data. The μ -Miner algorithm provides a
speed-up of orders of magnitude compared to the level-wise algorithm.
3
Advanced Methods
Though each of the methods described above has nice properties, we find that in
practice they do not perform as well as advertised. In general, we find that all absolute
measures identify far too many results as interesting, with or without condensation.
The key problem is redundancy. Absolute measures have no means of identifying
whether the score for a pattern is expected, nor are they able to prune variants of
patterns that identify single statistically significant concepts.
We identify three main lines of research aimed at tackling these problems, or in
other words, aimed at identifying more interesting patterns. A common theme in
these approaches is the reliance on statistical analysis. The main difference between
these methods and the methods described in the previous section is that in order to
rank patterns we impose a statistical model on our data, and measure how interesting
are the patterns given that model.
We can divide the methods into three rough categories:
1. Static pattern ranking . Here we assume that we know a simple statistical model,
derived from a simple background information. We assume that this model is
well-understood, and any pattern that is well-explained by this model should be
Search WWH ::




Custom Search