Database Reference
In-Depth Information
What makes likely a good addition to the model? A pattern that helps to reduce
redundancy in the encoding. In our setting, this means correlations between code
usages. If the code for pattern A and the code for pattern B often co-occur, we can gain
bits using a new code C meaning ' A and B '. We can hence find good candidates by
mining frequent patterns in 'encoding space'. Moreover, by employing an optimistic
estimate we can prune large parts of the search space, and efficiently identify the
best pattern [ 59 ]. In general terms, we have
1. Start with an 'empty' model M .
2. Find that F
X
that minimizes L (
D
, M
F ).
3. Add F to M .
4. Repeat steps 2-3 until compression can no longer be improved.
Because of the strong dependence on the specific encoding and pattern type, pro-
viding a universal algorithmic strategy for step 2 is hardly possible—in itemset data
correlations mean co-occurrences [ 59 ], in sequential data it means close-by occur-
rences [ 64 ], etc. In general, the current encoding of the data will have to be inspected
to see if there are any 'patterns' in there that can be exploited to improve compression.
The SLIM algorithm [ 59 ] was the first to implement this strategy for MDL, and
induces KRIMP code tables by iteratively searching for pairs of itemsets that often
occur together. The union of the pair that results in the best improvement in com-
pression is added to the code table. Although it hence considers a search space of
only O (
2 ), its results very closely approximate the ideal
local greedy strategy, or, KRAMP. In particular for dense data, SLIM can be orders
of magnitude faster than KRIMP, obtaining smaller code tables that offer much more
succinct descriptions of the data.
To save computation, SQS does not iteratively identify the best candidate, but
instead iteratively generates a set of candidates given the current model, considers
all these candidates in turn, then generates new candidates, etc, until MDL tells it to
stop.
|
CT
|
2 ) instead of O (
| F |
5
MDL for Data Mining
So far, we considered compression for model selection, but it has been argued [ 15 ]
and shown in the literature that it can also be used for many (data mining) tasks. For
example, we already referred to the Normalized Compression Distance [ 41 ]. Another
concrete example is the usage of MPEG video compression for image clustering [ 29 ].
In these examples, existing compression algorithms are used as 'black boxes' to
approximate Kolmogorov complexity, and usually only dataset-level compression is
required (to be precise, individual strings/objects are considered as 'datasets').
In this chapter, we are particularly interested in compression-based models that
allow for inspection, so that any discovered local structure can be interpreted by
domain experts. For that purpose pattern-based models that can be selected by means
of the MDL principle have been developed. However, we have not yet discussed if
Search WWH ::




Custom Search