Database Reference
In-Depth Information
cover function presented in Algorithm 4. As we will see in the next section, there
also exist more sophisticated algorithms for inducing code tables.
In practice, we find that KRIMP returns pattern sets in the order of hundreds to a
few thousand of patterns [ 68 ], which have been shown to describe the distribution
of the data very well. In the next section we will discuss some of the applications in
which these pattern sets have been successfully put to use.
Akoglu et al. [ 3 ] proposed the COMPREX algorithm, which describes a categorical
dataset by a set of KRIMP code tables—by partitioning the attributes into parts that
correlate strongly, and inducing a KRIMP code table for each part directly from the
data.
In frequent pattern mining, and hence KRIMP, we only regard associations be-
tween 1s of the data as potentially interesting. This is mostly a matter of keeping
matters computational feasible—clearly there are cases where associations between
occurrences and absences are rather interesting. LESS [ 24 ] is an algorithm that de-
scribes data not using frequent itemsets, but using low-entropy sets [ 23 ]. These are
itemsets for which we see the distribution its occurrences is strongly skewed. LESS
code tables consist of low-entropy sets, and it uses these to identify areas of the data
of where the attributes strongly interact. LESS code tables typically contain only tens
to hundreds of low-entropy sets. Attribute clustering [ 43 ] provides even more suc-
cinct code tables, with the goal to provide good high-level summaries of categorical
data, only up to tens of patterns are selected.
Code table instances for richer data include those for sequential patterns, i.e., serial
episodes. Bathoorn et al. [ 5 ] gave a variant of KRIMP for sequential patterns without
gaps, wheras the SQS [ 64 ] and GOKRIMP [ 34 ] algorithms provide fast algorithms for
descriptions in terms of serial episodes where gaps are allowed. Like KRIMP, these
algorithms find final selections in the order of hundreds of patterns.
Koopman and Siebes [ 32 , 33 ] discussed the KRIMP framework in light of frequent
patterns over multi-relational databases.
3.3.2
Other Model Classes
Like LESS, PACK [ 62 ] considers binary data symmetrically. Its patterns are itemsets,
but they are modeled in a decision tree instead of a code table. This way, probabilities
can be calculated more straightforwardly and refined MDL can be used for the
encoding. MTV [ 44 ] also constructs a probabilistic model of the data, and aims to
find that set of itemsets that best predicts the data. The framework allows both BIC
and MDL to be used for model selection. Typically, between tens and hundred of
itemsets are selected.
STIJL [ 63 ] describes data hierarchically in terms of dense and sparse tiles,
rectangles in the data which contain surprisingly many/few 1s.
We also find compression-based models in the literature that employ lossy com-
pression. While this contradicts MDL in principle, as long as the amount of 'lost'
data is not too large, relatively fair comparisons between models can still be made.
Search WWH ::




Custom Search