Database Reference
In-Depth Information
not any insight in the characteristics per cluster. For example, what is typical for that
cluster, and how do the different ingredients of the mixture compare to each other?
The pattern- and compression-based models described in this chapter provide all
prerequisites required for data characterization, classification, and difference mea-
surement. If a compression-based approach can be used to identify the components of
a database, each represented by a pattern model, all these advantages can be obtained
'for free'.
5.3.1
MDL for Component Identification
On a high level, the goal is to discover an optimal partitioning of the database; optimal,
in the sense that the characteristics of the different components are different, while
the individual components are homogeneous. Translating this to MDL, the task is to
partition a given database such that the total compressed size of the components is
minimized—where each component is compressed by its own MDL-optimal model.
The intuition is that similar tuples of a database can be better compressed if
they are assigned to the same partition and hence compressed by the same model.
However, having multiple components, with corresponding models, allows models
to be more specific and hence can be expected to provide better overall compression.
By minimizing the total compressed size, including the sizes of the models, the
different distributions of the mixture are expected to be divided over the partitions.
Following [ 39 ], we have the following problem statement:
Problem 5.1 (Identifying Database Components) Let
D
be a bag of tuples drawn
from
T
. Find a partitioning
D 1 ,
···
,
D k of
D
and associated models M 1 ,
···
, M k ,
such that the total compressed size of
D
,
L ( M i ,
D i ),
i ∈{
1,
···
, k }
is minimized .
There are a few of observations we should make with regard to this problem.
First of all, note that it is parameter-free: MDL determines the optimal number of
components. Second, asking for both the partitioning and the models is in a sense
redundant. For any partitioning, the best associated models are, of course, the optimal
ones. The other way around, given a set of models, a database partitions naturally:
each tuple goes to the model that compresses it best, as with classification.
The search space, however, is enormous, and solving the problem hard. An ef-
fective and efficient heuristic is to take an EM-like approach [ 14 ], starting with a
random partitioning, and iteratively inducing models and re-assigning tuples to maxi-
mize compression, until convergence. Besides automatically determining the optimal
number of components, this approach has been shown to find sound groupings [ 39 ].
Search WWH ::




Custom Search