Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

4.1.3

Pruning

Another improvement that can be used by any candidate filtering approach is to add

a pruning step: patterns that were added to the model before may become obsolete

later during the search. That is, due to other additions previously added patterns

may no longer contribute to improved compression. To remedy this, we can prune

the model, i.e., we can test whether removing patterns from the model results in

improved compression.

Again, there are many possibilities. The most obvious strategy is to check the

attained compression of all valid subsets of the current pattern set and choose the

corresponding model with minimal compressed size. One could even include a new

candidate pattern in this process, yet this requires considerable extra amount of

computation.

A more efficient alternative is to prune only directly after a candidate F is accepted.

To keep the pruning search space small, one could consider each pattern in the

current model for removal once after acceptance of another pattern, in a heuristic

order. If pruning a pattern does not result in an increased encoded size of data and

model, it apparently no longer contributes to compression. When this is the case, it

is permanently removed from the model. Even simple pruning techniques like this

can vastly improve the compression ratios attained by pattern-based models found

by candidate filtering methods.

Pruning has been shown to be one of the key elements of KRIMP [ 68 ], as it allows

for removing patterns from the model for which we have found better replacements,

and which if we keep them are in the way (in terms of cost) of other patterns. Pruning

practically always improves performance, both in terms of speed, compression rates,

as well as in smaller pattern sets [ 23 , 64 , 68 ].

4.2

Direct Mining of Patterns that Compress

Candidate filtering is conceptually easy and generally applicable. It allows us to mine

any set of candidate patterns, and then select a good subset. However, the reason

for mining code tables, the pattern explosion, is also the Achilles heel of this two-

stage approach. Mining, storing, and sorting candidate patterns is computationally

demanding for non-trivial data. In particular as lower thresholds correspond to better

models: larger candidate sets induce a larger model space, and hence allow for better

models to be discovered. However, the vast majority of these patterns will never be

selected or make it into the final model, the question is: can't we mine a good code

table directly from data?

The space of models

is too erratic to allow direct sampling of high-quality

code tables. We can, however, adapt the iterative candidate selection scheme above.

In particular, instead of iteratively identifying the best candidate from

M

F

, we use the

current model M to generate candidates that are likely good additions to the model.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home