Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

description out of all models in the model class, and it is that model that we are inter-

ested in—in the end, the length of the description is often not of much interest. When

compression is the goal, a general purpose compressor such as ZIP often provides

much better compression, as it can exploit many types of statistical dependencies.

In a similar vein, is it also important to note that in MDL we are not concerned with

materialized codes, but only interested in their lengths —again, as model selection

is the goal. Although a complete overview of all useful codes—for which we can

compute the optimal lengths in practice—is beyond the scope of this chapter, we

will discuss a few instances in the next chapter, where we will discuss how to use

MDL for pattern mining. Before we do so, however, let us quickly go into the general

applicability of MDL in data mining.

2.3

MDL in Data Mining

Faloutsos and Megalooikonomou [ 15 ] argue that Kolomogorov Complexity and

Minimum Description Length [ 20 , 52 ] provide a powerful and well-founded ap-

proach to data mining. There exist many examples where MDL has been successfully

employed in data mining, including, for example, for classification [ 37 , 50 ], clus-

tering [ 6 , 31 , 39 ], discretization [ 16 , 30 ], defining parameter-free distance measures

[ 11 , 28 , 29 , 66 ], feature selection [ 48 ], imputation [ 65 ], mining temporally surprising

patterns [ 8 ], detecting change points in data streams [ 36 ], model order selection in

matrix factorization [ 45 ], outlier detection [ 3 , 58 ], summarizing categorical data [ 43 ],

transfer learning [ 54 ], discovering communities in matrices [ 9 , 47 , 63 ] and evolving

graphs [ 60 ], finding sources of infection in large graphs [ 49 ], and for making sense

of selected nodes in graphs [ 4 ].

We will discuss a few of these instances in Sect. 5, but first cover how to define

an MDL score for a pattern based model.

3

Compression-based Pattern Models

In this section we introduce how to use the above foundations for mining small sets

of patterns that capture the data distribution well. We will give both the high level

picture and illustrate with concrete instances and examples. Before we go into details,

let us briefly describe the basic ingredients that are required for any pattern-based

model.

We assume that a dataset

is a bag of elements t of some data type—which

we, for simplicity, will refer to as tuples . In the context of frequent itemset mining,

each t is a transaction over a set of items

D

I

⊆ I

, i.e., t

. Similarly, we can consider

T

sequences, trees, graphs, time series, etc. Let us write

to denote the universe of

possible tuples for a given data type. Clearly, all tuples in

D

are elements from

T

.

Frequent Pattern Mining

Search WWH ::

Custom Search

Home