Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

and how these models can be used for tasks other than describing and summarizing

the data.

In the following we will show how many learning and mining tasks can be natu-

rally formalized in terms of compression, using the pattern-based models and MDL

formulation described in this chapter. In particular, to be able to give more concrete

details we will focus on using code tables as models. Again, it is important to note

that the overall approach can be applied to other compression- and pattern-based

models as well.

5.1

Classification

Classification is a traditional task in machine learning and data mining. Informally,

it can be summarized as follows: given a training set of tuples with class labels and

an 'unseen' tuple t without class label, use the training data to infer the correct class

label for t . Next, we describe a simple classification scheme based on the MDL

principle [ 37 ].

5.1.1

Classification through MDL

If we assume that a database

is an i.i.d. sample from some underlying data dis-

tribution, we expect that the optimal model for this database, i.e., optimal in MDL

sense, to compress an arbitrary tuple sampled from this distribution well. For this to

work, we need a model that supports tuple-level compression.

In the context of code tables, we make this intuition more formal in Lemma 8.13.

We say that the patterns in CT are independent if any co-occurrence of two patterns

X , Y

D

∈

CT in the cover of a tuple is independent. That is, P ( XY )

=

P ( X ) P ( Y ), a

Naïve Bayes [ 71 ] like assumption.

D

T

Lemma 8.13

Let

be a bag of tuples drawn from

, cover a cover function, CT

D

T

the optimal code table for

and t an arbitrary transaction from

. Then, if the

patterns X

∈

cover ( CT , t ) are independent,

L ( t

|

CT )

=−

log (P ( t

| D

, CT ) ) .

(See [ 37 ] for the proof.)

This lemma is only valid under the Naïve Bayes like assumption, which in theory

might be violated. However, by MDL, if there would be patterns X , Y

CT such

that P ( XY ) >P ( X ) P ( Y ), there will be a pattern Z in the optimal code table CT

that covers both X and Y .

Now, assume that we have two databases generated from two different underly-

ing distributions, with corresponding optimal code tables. For a new tuple that is

generated under one of the two distributions, we can now decide to which distribu-

tion it most likely belongs. That is, under the Naïve Bayes assumption, we have the

following lemma.

∈

Frequent Pattern Mining

Search WWH ::

Custom Search

Home