Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

5.4

Other Data Mining Tasks

So far we covered some of the most prominent tasks in data mining. However, many

more tasks have been formulated in terms of MDL and pattern-based models. Below,

we briefly describe five examples.

5.4.1

Data Generation—and Privacy Preservation

The MDL principle is primarily geared towards descriptive models. However, these

models can also be employed as predictive models, such as in the classification

example above. Furthermore, under certain conditions, compression-based models

can also be used as generative models .

By exploiting the close relation between code lengths and probability distribu-

tions, code tables can be used for data generation. For categorical data, synthetic

data generated from a KRIMP code table has the property that the deviation between

the observed and original frequencies is very small on expectation for all itemsets

[ 67 ]. One application is privacy preservation: the generated data has the same char-

acteristics as the original data, yet individual details are lost and specified levels of

anonymity can be obtained.

5.4.2

Missing Value Estimation

Many datasets have missing values. Under the assumption these are missing without

correlation to the data, they do not affect the observed overall distribution. Conse-

quently, despite those missing values, a model of reasonable quality can be induced

given sufficient data. Given such a database and corresponding model, the best es-

timation for a single missing value is the one that minimizes the total compressed

size. We can do so both for individual tuples, a well as for databases with many miss-

ing values: by iteratively imputing the values, and inducing the model, completed

datasets with very high accuracy are obtained [ 65 ].

5.4.3

Change Detection in Data Streams

A database can be a mixture of different distributions, but in data streams concept drift

is common: one distribution is 'replaced' by another distribution. In this context, it is

important to detect when such change occurs. Complicating issues are that streams

are usually infinite, can have high velocity, and only limited computation time is

available for processing.

By first assuming that the data stream is sampled from a single distribution, a

model can be induced on only few samples; how many are needed can be deduced

from the attained compression ratios. Once we have a model, we can observe the

compressed size of the new data; if this is considerably larger than for the earlier

Search WWH ::

Custom Search

Home