Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Other examples of compression based pattern mining algorithms employing single-

pass filtering include R-KRIMP [ 32 ], RDB-KRIMP [ 33 ], LESS [ 24 ], PACK [ 62 ], and

SQS [ 64 ].

4.1.2

Iterative Candidate Selection

Single-pass filtering is a very greedy search strategy. One particular point of concern

is that it considers every candidate only once, in fixed order, deciding acceptance or

rejection on the candidate's quality in relation to only the model mined up to that time.

This means that unless the candidate order is perfect, we will see that candidates get

rejected that would have been ideal later on, and hence that sub-optimal candidates

will be accepted because we do not have access to the optimal candidate at that time.

The reason this strategy still provides good results is exactly the problem it aims

to resolve: redundancy. For every rejected 'ideal' candidate we will (likely) see a

good enough variant later on.

The optimal result, however, may be a much smaller set of patterns that describe

the data much better. One way to approximate the optimal result better is to make the

search less greedy. Smets and Vreeken [ 59 ] showed that iteratively greedily adding

the locally optimal candidate leads to much better code tables.

1. Start with an 'empty' model M .

2. Select that F

∈ F

that minimizes L (

D

, M ∪ F ).

3. Add F to M and remove it from

.

4. Repeat steps 2-3 until compression can no longer be improved.

F

Naively, this entails iteratively re-ranking all candidates, and taking the best one.

That is, with regard to Chap. 5, this approach can be viewed as the dynamic ranking

approach to pattern set mining.

The naive implementation of this strategy is computationally much more demand-

ing than single-pass filtering, with a complexity of O (

2 ) opposed to O (

). On

the upside, it is less prone to local minima. If one desires to explore even a larger parts

of the search space, one could maintain the top- k best models after each iteration

instead of only the single best model. Such a strategy would essentially be a beam

search and is employed by the GROEI algorithm, as proposed by Siebes and Kersten

[ 56 ] to find good approximations to the problem of finding the best description of

the data in k patterns.

Instead of exactly calculating the quality of each candidate per iteration, which

requires a pass over the data and is hence expensive, we can also employ a quality

estimate. To this end, the MTV algorithm uses a convex quality estimate [ 44 ], which

allows both to effectively prune a large part of the candidate space, as well as to

identify the best candidate without having to calculate the actual score. SLIM [ 59 ] uses

an optimistic estimate, and only calculates the actual score for the top- k candidates

until one is accepted by MDL.

| F |

Frequent Pattern Mining

Search WWH ::

Custom Search

Home