Database Reference
In-Depth Information
4.1.3
Pruning
Another improvement that can be used by any candidate filtering approach is to add
a pruning step: patterns that were added to the model before may become obsolete
later during the search. That is, due to other additions previously added patterns
may no longer contribute to improved compression. To remedy this, we can prune
the model, i.e., we can test whether removing patterns from the model results in
improved compression.
Again, there are many possibilities. The most obvious strategy is to check the
attained compression of all valid subsets of the current pattern set and choose the
corresponding model with minimal compressed size. One could even include a new
candidate pattern in this process, yet this requires considerable extra amount of
computation.
A more efficient alternative is to prune only directly after a candidate F is accepted.
To keep the pruning search space small, one could consider each pattern in the
current model for removal once after acceptance of another pattern, in a heuristic
order. If pruning a pattern does not result in an increased encoded size of data and
model, it apparently no longer contributes to compression. When this is the case, it
is permanently removed from the model. Even simple pruning techniques like this
can vastly improve the compression ratios attained by pattern-based models found
by candidate filtering methods.
Pruning has been shown to be one of the key elements of KRIMP [ 68 ], as it allows
for removing patterns from the model for which we have found better replacements,
and which if we keep them are in the way (in terms of cost) of other patterns. Pruning
practically always improves performance, both in terms of speed, compression rates,
as well as in smaller pattern sets [ 23 , 64 , 68 ].
4.2
Direct Mining of Patterns that Compress
Candidate filtering is conceptually easy and generally applicable. It allows us to mine
any set of candidate patterns, and then select a good subset. However, the reason
for mining code tables, the pattern explosion, is also the Achilles heel of this two-
stage approach. Mining, storing, and sorting candidate patterns is computationally
demanding for non-trivial data. In particular as lower thresholds correspond to better
models: larger candidate sets induce a larger model space, and hence allow for better
models to be discovered. However, the vast majority of these patterns will never be
selected or make it into the final model, the question is: can't we mine a good code
table directly from data?
The space of models
is too erratic to allow direct sampling of high-quality
code tables. We can, however, adapt the iterative candidate selection scheme above.
In particular, instead of iteratively identifying the best candidate from
M
F
, we use the
current model M to generate candidates that are likely good additions to the model.
Search WWH ::




Custom Search