Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

Naïve Bayes and SVMs. This may be considered an unexpectedly positive result,

as the sole goal of each code table is to characterize and describe an individual

class-based database. The fact that these code tables can in practice also be used

for distinguishing samples drawn from the different distributions means they indeed

capture these very well.

5.2

A Dissimilarity Measure for Datasets

Comparing datasets to find and explain differences is a frequent task in many or-

ganizations. The two databases can, e.g., originate from different branches of the

same organizations, such as sales records from different stores of a chain or the

“same” database at different points in time. A first step towards identifying differences

between datasets is to quantify how different two datasets are.

Although this may appear to be a simple task at first sight, in practice it turns out

to be far from trivial in many cases. In particular, this is true when considering data

types for which no obvious distance measures are available, such as for categorical

data. In this subsection we describe a compression-based difference measure for

datasets (based on [ 66 ]).

5.2.1

Code Length Differences

Let

D 1 and

D 2 be two databases with tuples drawn from the same data universe

T

.

The MDL principle implies that the optimal compressor induced from a database

D 1

will generally provide shorter encodings for its tuples than the optimal compressor

induced from another database

D 2 . This is the same principle as used by the classifier

described in the previous subsection, and again we assume and exploit the tuple-level

compression property.

Formally, let M i be the optimal model induced from database

D i , and t a

transaction in

D 1 . Then, the MDL principle implies that

|

L ( t

|

M 2 )

−

L ( t

|

M 1 )

|

is small if t is equally likely under the distributions of

D 1 and

D 2 ;

is large if t is more likely under the distribution of one database than under the

distribution underlying the other.

Furthermore, the MDL principle implies that for the MDL-optimal models M 1 and

M 2 and t from

M 1 ) is positive. The

next step towards a dissimilarity measure is to aggregate these code length differences

over the dataset.

If we would do this naively, the resulting aggregate would depend on the size of

the data. To avoid this, we normalize by dividing by the 'native' encoded size of the

D 1 , the expected average value of L ( t

|

M 2 )

−

L ( t

|

Frequent Pattern Mining

Search WWH ::

Custom Search

Home