Database Reference
In-Depth Information
description out of all models in the model class, and it is that model that we are inter-
ested in—in the end, the length of the description is often not of much interest. When
compression is the goal, a general purpose compressor such as ZIP often provides
much better compression, as it can exploit many types of statistical dependencies.
In a similar vein, is it also important to note that in MDL we are not concerned with
materialized codes, but only interested in their lengths —again, as model selection
is the goal. Although a complete overview of all useful codes—for which we can
compute the optimal lengths in practice—is beyond the scope of this chapter, we
will discuss a few instances in the next chapter, where we will discuss how to use
MDL for pattern mining. Before we do so, however, let us quickly go into the general
applicability of MDL in data mining.
2.3
MDL in Data Mining
Faloutsos and Megalooikonomou [ 15 ] argue that Kolomogorov Complexity and
Minimum Description Length [ 20 , 52 ] provide a powerful and well-founded ap-
proach to data mining. There exist many examples where MDL has been successfully
employed in data mining, including, for example, for classification [ 37 , 50 ], clus-
tering [ 6 , 31 , 39 ], discretization [ 16 , 30 ], defining parameter-free distance measures
[ 11 , 28 , 29 , 66 ], feature selection [ 48 ], imputation [ 65 ], mining temporally surprising
patterns [ 8 ], detecting change points in data streams [ 36 ], model order selection in
matrix factorization [ 45 ], outlier detection [ 3 , 58 ], summarizing categorical data [ 43 ],
transfer learning [ 54 ], discovering communities in matrices [ 9 , 47 , 63 ] and evolving
graphs [ 60 ], finding sources of infection in large graphs [ 49 ], and for making sense
of selected nodes in graphs [ 4 ].
We will discuss a few of these instances in Sect. 5, but first cover how to define
an MDL score for a pattern based model.
3
Compression-based Pattern Models
In this section we introduce how to use the above foundations for mining small sets
of patterns that capture the data distribution well. We will give both the high level
picture and illustrate with concrete instances and examples. Before we go into details,
let us briefly describe the basic ingredients that are required for any pattern-based
model.
We assume that a dataset
is a bag of elements t of some data type—which
we, for simplicity, will refer to as tuples . In the context of frequent itemset mining,
each t is a transaction over a set of items
D
I
I
, i.e., t
. Similarly, we can consider
T
sequences, trees, graphs, time series, etc. Let us write
to denote the universe of
possible tuples for a given data type. Clearly, all tuples in
D
are elements from
T
.
Search WWH ::




Custom Search