Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

database, L (

D 1 | M 1 ), and arrive at

L (

D 1 |

M 2 )

−

L (

D 1 |

M 1 )

ACLD (

D 1 , M 2 )

L (

D 1 | M 1 )

Like Kullback-Leibler divergence, ACLD is asymmetric: it measures how different

D 2 is from

D 1 , not vice versa. While it is reasonable to expect these to be in the same

ballpark, this is not a given.

5.2.2

The Database Dissimilarity Measure

The asymmetric measure allows measuring similarity of one database to an-

other. To make it a practical measure we would like it to be symmetric.

We do this by taking the maximum value of two aggregated differences, i.e.,

max

{

ACLD (

D 1 , M 2 ), ACLD (

D 2 , M 1 )

}

. This can easily be rewritten in terms of

compressed database sizes, as follows.

Definition 8.15

, and let M 1 and M 2

be their corresponding MDL-optimal models. Then, define the dissimilarity measure

DS between

Let

D 1 and

D 2 be two databases drawn from

D 1 and

D 2 as

max L (

D 1 |

M 2 )

−

L (

D 1 |

M 1 )

, L (

D 2 |

M 1 )

−

L (

D 2 |

M 2 )

DS (

D 1 ,

D 2 )

L (

D 1 |

M 1 )

L (

D 2 |

M 2 )

Using this measure, we'll obtain a score of 0 iff the databases are identical, and higher

scores indicate higher dissimilarity. In theory, using MDL-optimal models we find

that DS , like NCD [ 11 ] is a metric: the symmetry axiom holds by definition, scores

cannot be negative, and it holds that DS (

D 1 = D 2 . The advantage

of DS over NCD is that we only have to induce two models, as opposed to four.

For heuristic model induction algorithms the metric property is difficult to prove.

However, instantiating this measure for itemset data using KRIMP, we obtain very

good results [ 66 ]: dataset pairs drawn from the same distribution have very low

dissimilarities, whereas dataset pairs from different distributions have substantially

larger dissimilarities.

D 1 ,

D 2 )

0iff

5.3

Identifying and Characterizing Components

Though most databases are mixtures drawn from different distributions, we often as-

sume only one distribution. Clearly, this leads to suboptimal results: the distributions

need to be modeled individually.

Clustering addresses part of this problem by trying to separate the source com-

ponents that make up the mixture. However, as we do not know upfront what

distinguishes the different components, the appropriate distance metric is hard to

define. Furthermore, in clustering we are only returned the object assignment, and

Frequent Pattern Mining

Search WWH ::

Custom Search

Home