Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Moreover, it allows us to use compression for data mining tasks, such as for

computing data dissimilarity. Note that this property generally holds for any generic

compressor, and therefore compression algorithms like those in ZIP, GZIP, BZIP, etc,

can also be used for such tasks. As a concrete example, the family of Normalized

Compression Distance measures [ 41 ] rely on this.

As a slight variant, we can also fix the model M and see how well it compresses

another dataset. That is, we require that L (

D |

M ) is explicitly calculable for any

D

of the specified data universe

. This allows us to calculate how

well a dataset matches the distribution of another dataset. See also Sect. 5.

T

and any M

∈ M

Tuple-level Compression In addition, a rather useful property is when each tuple

t

∈ D

can be compressed individually, independent of all other tuples. That is,

L ( t

|

M ) can be computed for a given M . This also implies we can simply calculate

L (

D |

M )as

L (

D |

M )

=

L ( t

|

M ) .

t ∈ D

This property simplifies many aspects related to the induction and usage of the

models. For example, as a consequence, calculating the encoded size can now be

trivially parallelized. More important, though, is that common data mining tasks

such as classification and clustering are now straightforward. More on this later.

Pattern-level Inspection The third and final property that we discuss here is that of

sub-tuple ,or, pattern-level inspection. That is, beyond computing L ( t

M ) we also

want to be able to inspect how a given tuple is encoded: what structure, in the form

of a set of patterns, is used to compress it?

With this property, it becomes possible to provide explanations for certain out-

comes (e.g., explain why is a certain tuple compressed better by one model than by

another), but also to exploit this information to improve the model (e.g., patterns

that often occur together in a tuple should probably be combined). Effectively, it is

this property that makes pattern-based solutions so powerful, as it ensures that in

addition to decisions, we can offer explanations .

|

3.2

Code Tables

The conceptually most simple, as well as most commonly used pattern-based model

for MDL are so-called code tables (see e.g., [ 23 , 56 , 57 , 59 , 64 , 68 ]). Informally, a

code table is a dictionary, a translation table between patterns and codes. Each entry

in the left column contains a pattern and corresponds to exactly one code word in

the right column. Such a code table can be used to compress the data by replacing

occurrences of patterns with their corresponding codes, and vice versa to decode

an encoded dataset and reconstruct the original the data. Using the MDL principle,

the problem can then be formulated as finding that code table that gives the best

compression.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home