Database Reference
In-Depth Information
and how these models can be used for tasks other than describing and summarizing
the data.
In the following we will show how many learning and mining tasks can be natu-
rally formalized in terms of compression, using the pattern-based models and MDL
formulation described in this chapter. In particular, to be able to give more concrete
details we will focus on using code tables as models. Again, it is important to note
that the overall approach can be applied to other compression- and pattern-based
models as well.
5.1
Classification
Classification is a traditional task in machine learning and data mining. Informally,
it can be summarized as follows: given a training set of tuples with class labels and
an 'unseen' tuple t without class label, use the training data to infer the correct class
label for t . Next, we describe a simple classification scheme based on the MDL
principle [ 37 ].
5.1.1
Classification through MDL
If we assume that a database
is an i.i.d. sample from some underlying data dis-
tribution, we expect that the optimal model for this database, i.e., optimal in MDL
sense, to compress an arbitrary tuple sampled from this distribution well. For this to
work, we need a model that supports tuple-level compression.
In the context of code tables, we make this intuition more formal in Lemma 8.13.
We say that the patterns in CT are independent if any co-occurrence of two patterns
X , Y
D
CT in the cover of a tuple is independent. That is, P ( XY )
=
P ( X ) P ( Y ), a
Naïve Bayes [ 71 ] like assumption.
D
T
Lemma 8.13
Let
be a bag of tuples drawn from
, cover a cover function, CT
D
T
the optimal code table for
and t an arbitrary transaction from
. Then, if the
patterns X
cover ( CT , t ) are independent,
L ( t
|
CT )
=−
log (P ( t
| D
, CT ) ) .
(See [ 37 ] for the proof.)
This lemma is only valid under the Naïve Bayes like assumption, which in theory
might be violated. However, by MDL, if there would be patterns X , Y
CT such
that P ( XY ) >P ( X ) P ( Y ), there will be a pattern Z in the optimal code table CT
that covers both X and Y .
Now, assume that we have two databases generated from two different underly-
ing distributions, with corresponding optimal code tables. For a new tuple that is
generated under one of the two distributions, we can now decide to which distribu-
tion it most likely belongs. That is, under the Naïve Bayes assumption, we have the
following lemma.
Search WWH ::




Custom Search