Database Reference
In-Depth Information
Other examples of compression based pattern mining algorithms employing single-
pass filtering include R-KRIMP [ 32 ], RDB-KRIMP [ 33 ], LESS [ 24 ], PACK [ 62 ], and
SQS [ 64 ].
4.1.2
Iterative Candidate Selection
Single-pass filtering is a very greedy search strategy. One particular point of concern
is that it considers every candidate only once, in fixed order, deciding acceptance or
rejection on the candidate's quality in relation to only the model mined up to that time.
This means that unless the candidate order is perfect, we will see that candidates get
rejected that would have been ideal later on, and hence that sub-optimal candidates
will be accepted because we do not have access to the optimal candidate at that time.
The reason this strategy still provides good results is exactly the problem it aims
to resolve: redundancy. For every rejected 'ideal' candidate we will (likely) see a
good enough variant later on.
The optimal result, however, may be a much smaller set of patterns that describe
the data much better. One way to approximate the optimal result better is to make the
search less greedy. Smets and Vreeken [ 59 ] showed that iteratively greedily adding
the locally optimal candidate leads to much better code tables.
1. Start with an 'empty' model M .
2. Select that F
F
that minimizes L (
D
, M F ).
3. Add F to M and remove it from
.
4. Repeat steps 2-3 until compression can no longer be improved.
F
Naively, this entails iteratively re-ranking all candidates, and taking the best one.
That is, with regard to Chap. 5, this approach can be viewed as the dynamic ranking
approach to pattern set mining.
The naive implementation of this strategy is computationally much more demand-
ing than single-pass filtering, with a complexity of O (
2 ) opposed to O (
). On
the upside, it is less prone to local minima. If one desires to explore even a larger parts
of the search space, one could maintain the top- k best models after each iteration
instead of only the single best model. Such a strategy would essentially be a beam
search and is employed by the GROEI algorithm, as proposed by Siebes and Kersten
[ 56 ] to find good approximations to the problem of finding the best description of
the data in k patterns.
Instead of exactly calculating the quality of each candidate per iteration, which
requires a pass over the data and is hence expensive, we can also employ a quality
estimate. To this end, the MTV algorithm uses a convex quality estimate [ 44 ], which
allows both to effectively prune a large part of the candidate space, as well as to
identify the best candidate without having to calculate the actual score. SLIM [ 59 ] uses
an optimistic estimate, and only calculates the actual score for the top- k candidates
until one is accepted by MDL.
| F |
| F |
Search WWH ::




Custom Search