Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Summarization [ 10 ] is such an approach, which identifies a group of itemsets such

that each transaction is summarized by one itemset with as little loss of information

as possible. Wang et al. [ 70 ] find summary sets, sets of itemsets such that each

transaction is (partially) covered by the largest itemset that is frequent.

There are also model classes where the link to compression exists, but is hidden

from plain sight. TILING [ 18 ] should be mentioned: a tiling is the cover of the database

by the smallest set of itemsets, and is related to Set Cover [ 27 ], Minimum Entropy

Set Cover [ 22 ], and matrix factorization problems [ 42 , 45 ].

4

Algorithmic Approaches

So far we have discussed in detail the motivation, theoretical foundations, and models

for compression-based pattern mining. Given the previous, the natural follow-up

question is: given a dataset, how can we find that model that minimizes the total

compressed size?

In this section we aim to give a brief overview of the main algorithmic strategies

for inducing good code tables from data. There are two main approaches we need to

discuss: candidate filtering and direct mining.

In our concise discussion on the complexity of the Minimum Description Length

Code Table problem, we already mentioned that the search space will generally be too

large to consider exhaustively. Hence, as is common with MDL-based approaches,

the common solution is to resort to heuristic search strategies. This obviously implies

that we usually cannot guarantee to find the best possible model, and experimental

evaluation will have to reveal how useful induced models are.

In this section, we will outline common techniques. For a more in-depth discussion

of the individual algorithms, we refer to the original papers; algorithmic aspects are

not the main focus of this chapter.

4.1

Candidate Set Filtering

The definition of Problem 3.1 already hints at the most often used approach: candidate

filtering. While the set of candidates

F

could consist of all possible patterns

X

, it can

also be a subset defined by some additional constraints. Typically,

is generated in

advance and given as argument to the algorithm used for model induction.

For example, when inducing itemset-based models, it is common practice to use

closed itemsets with a given minimum support threshold as candidate set. A large

advantage of using smaller candidate sets, i.e., keeping

F

| F |

small, is that model

induction can be done relatively quickly.

Given a dataset

D

F

, a candidate set filtering method returns

a model M corresponding to a pattern set PS

and candidate set

⊂ F

D

, M ) is 'small'.

(Note that we cannot claim that the compressed size is minimal due to the heuristic

nature of filtering methods.)

for which L (

Frequent Pattern Mining

Search WWH ::

Custom Search

Home