Interesting Patterns - Frequent Pattern Mining

Database Reference

In-Depth Information

is a popular choice. This means that in practice, pattern set mining methods and

dynamic modeling have a lot in common; although in iterative ranking there is no

explicit global goal defined, in order to find a ranking, we iteratively greedily find the

locally most informative pattern, and update the model. A key difference to pattern

set mining is that here we explicitly take the complexity of the pattern set in check;

we consider both the gain in quality, as well as the cost in complexity over the full

set.

(Though pattern set mining is a combinatorial problem, and intuitively optimizing

most instances seems very complex, so far theoretical hardness results have only been

found for a handful of cases, including [ 49 , 48 , 75 ].)

6.1

Itemsets

KRIMP is among the most well-known pattern set mining algorithms. Siebes et al. [ 62 ]

define the best set of itemsets by the Minimum Description Length principle as the

set that provides the best lossless compression. Each itemset is assigned a code word,

the length of which depends on how frequently the itemset is used when greedily

covering the data without overlap. The pattern set is then scored on the number of

bits necessary to lossless describe the data. That is, the sum over the number of bits

to encode the dictionary, the itemsets and their code words, and number of bits to

encode the data using these code words.

The KRIMP algorithm heuristically finds good sets by first mining frequent itemsets

up to a given minsup threshold, and greedily selecting from these in a fixed order.

The resulting pattern sets are typically small (100 s) and have been shown to be

useful in many data mining tasks [ 70 ] (see also Chap. 8). There exist a number of

variants of KRIMP for other data types, including for low-entropy sets [ 31 ]. Siebes and

Kersten [ 61 ] investigated to structure functions, and proposed the GROEI algorithm

for approximating the optimal k -pattern set.

Alternative to a descriptive model, we can aim to find a good generative model.

The MINI algorithm proposed by Gallo et al. [ 20 ] employs a probability distribution

based on itemsets and frequencies, and aims finding the set of itemsets that predict

the data best.

While intuitively appealing, using likelihood to score pattern sets, however, is not

enough since the score increases w.r.t. the inclusion of pattern set, that is, the set

containing all itemsets will have the highest likelihood. To control the size of the set

we need to exert some control over the output set. For example, we can either ask

the user to give a number of patterns k , or automatically control the number of the

itemsets by BIC [ 59 ] or MDL [ 58 ], in which case the improvement of adding a new

itemset into a result set must be significant.

The MaxEnt model employed by the MTV algorithm, which we met in the previous

section, does not only lend itself for iterative ranking, but can also be straight-

forwardly used with either of test two model selection criteria [ 42 ]. In practice,

Search WWH ::

Custom Search

Home