Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

not any insight in the characteristics per cluster. For example, what is typical for that

cluster, and how do the different ingredients of the mixture compare to each other?

The pattern- and compression-based models described in this chapter provide all

prerequisites required for data characterization, classification, and difference mea-

surement. If a compression-based approach can be used to identify the components of

a database, each represented by a pattern model, all these advantages can be obtained

'for free'.

5.3.1

MDL for Component Identification

On a high level, the goal is to discover an optimal partitioning of the database; optimal,

in the sense that the characteristics of the different components are different, while

the individual components are homogeneous. Translating this to MDL, the task is to

partition a given database such that the total compressed size of the components is

minimized—where each component is compressed by its own MDL-optimal model.

The intuition is that similar tuples of a database can be better compressed if

they are assigned to the same partition and hence compressed by the same model.

However, having multiple components, with corresponding models, allows models

to be more specific and hence can be expected to provide better overall compression.

By minimizing the total compressed size, including the sizes of the models, the

different distributions of the mixture are expected to be divided over the partitions.

Following [ 39 ], we have the following problem statement:

Problem 5.1 (Identifying Database Components) Let

D

be a bag of tuples drawn

from

T

. Find a partitioning

D 1 ,

···

,

D k of

D

and associated models M 1 ,

···

, M k ,

such that the total compressed size of

D

,

L ( M i ,

D i ),

i ∈{

1,

···

, k }

is minimized .

There are a few of observations we should make with regard to this problem.

First of all, note that it is parameter-free: MDL determines the optimal number of

components. Second, asking for both the partitioning and the models is in a sense

redundant. For any partitioning, the best associated models are, of course, the optimal

ones. The other way around, given a set of models, a database partitions naturally:

each tuple goes to the model that compresses it best, as with classification.

The search space, however, is enormous, and solving the problem hard. An ef-

fective and efficient heuristic is to take an EM-like approach [ 14 ], starting with a

random partitioning, and iteratively inducing models and re-assigning tuples to maxi-

mize compression, until convergence. Besides automatically determining the optimal

number of components, this approach has been shown to find sound groupings [ 39 ].

Frequent Pattern Mining

Search WWH ::

Custom Search

Home