Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

2.3

Low Entropy Sets

The final absolute measure for interestingness we discuss is entropy . Whereas many

of the above measures put explicit importance on the associations between 1s in the

data, by ignoring or penalizing 0s. This, however, ignores the fact that there may

be interesting associations in the data between both the 1s and the 0s. Heikinheimo

et al. [ 30 ] hence argue to put equal importance on both 0 / 1, and instead of mining

frequent itemsets, propose to mine itemsets for which the counts of the contingency

table are highly skewed. That is, for an itemset X we calculate the support of all of

its 2 | X | instances, and calculate the entropy over these counts. The score is minimal

(0) when only one instance occurs in the data, e.g., if for itemset X

=

abc if we find

supp ( abc

=

110)

=|

D

|

, while the score is maximal (

|

X

|

) when all instances have

the same support.

Intuition An itemset X is interesting if the distribution of the data is highly skewed,

i.e., either highly structured or very random.

Using this score, which exhibits monotonicity, we can use the level-wise algorithm

to efficiently mine either low entropy sets, if one is interested in highly structured

parts of the data, or to mine high entropy sets if one is interested in identifying the

most random parts of the data. Mampaey [ 41 ] proposed to speed up the mining by

using inclusion-exclusion, making use of the fact that in practice only a fraction of

all 2 | X | possible instances of X occur in the data. The μ -Miner algorithm provides a

speed-up of orders of magnitude compared to the level-wise algorithm.

3

Advanced Methods

Though each of the methods described above has nice properties, we find that in

practice they do not perform as well as advertised. In general, we find that all absolute

measures identify far too many results as interesting, with or without condensation.

The key problem is redundancy. Absolute measures have no means of identifying

whether the score for a pattern is expected, nor are they able to prune variants of

patterns that identify single statistically significant concepts.

We identify three main lines of research aimed at tackling these problems, or in

other words, aimed at identifying more interesting patterns. A common theme in

these approaches is the reliance on statistical analysis. The main difference between

these methods and the methods described in the previous section is that in order to

rank patterns we impose a statistical model on our data, and measure how interesting

are the patterns given that model.

We can divide the methods into three rough categories:

1. Static pattern ranking . Here we assume that we know a simple statistical model,

derived from a simple background information. We assume that this model is

well-understood, and any pattern that is well-explained by this model should be

Frequent Pattern Mining

Search WWH ::

Custom Search

Home