Database Reference
In-Depth Information
Summarization [ 10 ] is such an approach, which identifies a group of itemsets such
that each transaction is summarized by one itemset with as little loss of information
as possible. Wang et al. [ 70 ] find summary sets, sets of itemsets such that each
transaction is (partially) covered by the largest itemset that is frequent.
There are also model classes where the link to compression exists, but is hidden
from plain sight. TILING [ 18 ] should be mentioned: a tiling is the cover of the database
by the smallest set of itemsets, and is related to Set Cover [ 27 ], Minimum Entropy
Set Cover [ 22 ], and matrix factorization problems [ 42 , 45 ].
4
Algorithmic Approaches
So far we have discussed in detail the motivation, theoretical foundations, and models
for compression-based pattern mining. Given the previous, the natural follow-up
question is: given a dataset, how can we find that model that minimizes the total
compressed size?
In this section we aim to give a brief overview of the main algorithmic strategies
for inducing good code tables from data. There are two main approaches we need to
discuss: candidate filtering and direct mining.
In our concise discussion on the complexity of the Minimum Description Length
Code Table problem, we already mentioned that the search space will generally be too
large to consider exhaustively. Hence, as is common with MDL-based approaches,
the common solution is to resort to heuristic search strategies. This obviously implies
that we usually cannot guarantee to find the best possible model, and experimental
evaluation will have to reveal how useful induced models are.
In this section, we will outline common techniques. For a more in-depth discussion
of the individual algorithms, we refer to the original papers; algorithmic aspects are
not the main focus of this chapter.
4.1
Candidate Set Filtering
The definition of Problem 3.1 already hints at the most often used approach: candidate
filtering. While the set of candidates
F
could consist of all possible patterns
X
, it can
also be a subset defined by some additional constraints. Typically,
is generated in
advance and given as argument to the algorithm used for model induction.
For example, when inducing itemset-based models, it is common practice to use
closed itemsets with a given minimum support threshold as candidate set. A large
advantage of using smaller candidate sets, i.e., keeping
F
| F |
small, is that model
induction can be done relatively quickly.
Given a dataset
D
F
, a candidate set filtering method returns
a model M corresponding to a pattern set PS
and candidate set
F
D
, M ) is 'small'.
(Note that we cannot claim that the compressed size is minimal due to the heuristic
nature of filtering methods.)
for which L (
Search WWH ::




Custom Search