Database Reference
In-Depth Information
5.4
Other Data Mining Tasks
So far we covered some of the most prominent tasks in data mining. However, many
more tasks have been formulated in terms of MDL and pattern-based models. Below,
we briefly describe five examples.
5.4.1
Data Generation—and Privacy Preservation
The MDL principle is primarily geared towards descriptive models. However, these
models can also be employed as predictive models, such as in the classification
example above. Furthermore, under certain conditions, compression-based models
can also be used as generative models .
By exploiting the close relation between code lengths and probability distribu-
tions, code tables can be used for data generation. For categorical data, synthetic
data generated from a KRIMP code table has the property that the deviation between
the observed and original frequencies is very small on expectation for all itemsets
[ 67 ]. One application is privacy preservation: the generated data has the same char-
acteristics as the original data, yet individual details are lost and specified levels of
anonymity can be obtained.
5.4.2
Missing Value Estimation
Many datasets have missing values. Under the assumption these are missing without
correlation to the data, they do not affect the observed overall distribution. Conse-
quently, despite those missing values, a model of reasonable quality can be induced
given sufficient data. Given such a database and corresponding model, the best es-
timation for a single missing value is the one that minimizes the total compressed
size. We can do so both for individual tuples, a well as for databases with many miss-
ing values: by iteratively imputing the values, and inducing the model, completed
datasets with very high accuracy are obtained [ 65 ].
5.4.3
Change Detection in Data Streams
A database can be a mixture of different distributions, but in data streams concept drift
is common: one distribution is 'replaced' by another distribution. In this context, it is
important to detect when such change occurs. Complicating issues are that streams
are usually infinite, can have high velocity, and only limited computation time is
available for processing.
By first assuming that the data stream is sampled from a single distribution, a
model can be induced on only few samples; how many are needed can be deduced
from the attained compression ratios. Once we have a model, we can observe the
compressed size of the new data; if this is considerably larger than for the earlier
Search WWH ::




Custom Search