Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

cover function presented in Algorithm 4. As we will see in the next section, there

also exist more sophisticated algorithms for inducing code tables.

In practice, we find that KRIMP returns pattern sets in the order of hundreds to a

few thousand of patterns [ 68 ], which have been shown to describe the distribution

of the data very well. In the next section we will discuss some of the applications in

which these pattern sets have been successfully put to use.

Akoglu et al. [ 3 ] proposed the COMPREX algorithm, which describes a categorical

dataset by a set of KRIMP code tables—by partitioning the attributes into parts that

correlate strongly, and inducing a KRIMP code table for each part directly from the

data.

In frequent pattern mining, and hence KRIMP, we only regard associations be-

tween 1s of the data as potentially interesting. This is mostly a matter of keeping

matters computational feasible—clearly there are cases where associations between

occurrences and absences are rather interesting. LESS [ 24 ] is an algorithm that de-

scribes data not using frequent itemsets, but using low-entropy sets [ 23 ]. These are

itemsets for which we see the distribution its occurrences is strongly skewed. LESS

code tables consist of low-entropy sets, and it uses these to identify areas of the data

of where the attributes strongly interact. LESS code tables typically contain only tens

to hundreds of low-entropy sets. Attribute clustering [ 43 ] provides even more suc-

cinct code tables, with the goal to provide good high-level summaries of categorical

data, only up to tens of patterns are selected.

Code table instances for richer data include those for sequential patterns, i.e., serial

episodes. Bathoorn et al. [ 5 ] gave a variant of KRIMP for sequential patterns without

gaps, wheras the SQS [ 64 ] and GOKRIMP [ 34 ] algorithms provide fast algorithms for

descriptions in terms of serial episodes where gaps are allowed. Like KRIMP, these

algorithms find final selections in the order of hundreds of patterns.

Koopman and Siebes [ 32 , 33 ] discussed the KRIMP framework in light of frequent

patterns over multi-relational databases.

3.3.2

Other Model Classes

Like LESS, PACK [ 62 ] considers binary data symmetrically. Its patterns are itemsets,

but they are modeled in a decision tree instead of a code table. This way, probabilities

can be calculated more straightforwardly and refined MDL can be used for the

encoding. MTV [ 44 ] also constructs a probabilistic model of the data, and aims to

find that set of itemsets that best predicts the data. The framework allows both BIC

and MDL to be used for model selection. Typically, between tens and hundred of

itemsets are selected.

STIJL [ 63 ] describes data hierarchically in terms of dense and sparse tiles,

rectangles in the data which contain surprisingly many/few 1s.

We also find compression-based models in the literature that employ lossy com-

pression. While this contradicts MDL in principle, as long as the amount of 'lost'

data is not too large, relatively fair comparisons between models can still be made.

Search WWH ::

Custom Search

Home