Mining and Using Sets of Patterns through Compression - Frequent Pattern Mining

Database Reference

In-Depth Information

samples, a change has occurred and a new model should be induced. In particular

for sudden distribution shifts, this scheme is highly effective [ 36 ].

5.4.4

Coherent Group Discovery

Whereas the Identifying Database Components problem assumes that we are inter-

ested in a partitioning of the complete database, this task aims at the discovery of

coherent subsets of the data that deviate from the overall distribution. As such, it is

an instance of subspace clustering. In terms of MDL, this means that the goal is to

find groups that can be compressed much better by themselves than as part of the

complete database.

As example application, this approach was applied to tag data obtained for differ-

ent media types [ 38 ]. It was shown that using only tag information, coherent groups

of media, e.g., photos, can be discovered.

5.4.5

Outlier Detection

All databases contain outliers, but defining what an outlier exactly is and detecting

them are well-known to be challenging tasks. By assuming that the number of outliers

is small, and given the intuition of what an outlier is this seems a safe assumption,

we know that the largest part of a dataset is 'normal'. Hence, a model induced on the

database should capture primarily what is normal, and not so much what is an outlier.

Then, outlier detection can be formalized as a one-class classification problem: all

tuples that are compressed well belong to the 'normal' distribution, while tuples that

get a long encoding may be considered outliers. For transactional data, this approach

performs on par with the state-of-the-art of the field [ 58 ].

5.5

The Advantage of Pattern-based Models

For each and every of these tasks, we have to point out the added benefit of using

a pattern-based model. Besides obtaining competitive, state-of-the-art performance,

these patterns help to characterize decisions. For example, in the case of outlier

detection, we can identify why a tuple is identified as an anomaly by pointing out

the patterns of the norm it does not comply with, as well as how strongly it is an

anomaly—how much effort we have to do in order to make it 'normal'. Similar

advantages hold for the classification task. For the clustering related tasks, we have

the added benefit that we can offer specialized code tables, specialized descriptions

per subpart of the data; we are not only told which parts of the data should go together,

but also why, what patterns make these data points similar.

Search WWH ::

Custom Search

Home