Database Reference
In-Depth Information
samples, a change has occurred and a new model should be induced. In particular
for sudden distribution shifts, this scheme is highly effective [ 36 ].
5.4.4
Coherent Group Discovery
Whereas the Identifying Database Components problem assumes that we are inter-
ested in a partitioning of the complete database, this task aims at the discovery of
coherent subsets of the data that deviate from the overall distribution. As such, it is
an instance of subspace clustering. In terms of MDL, this means that the goal is to
find groups that can be compressed much better by themselves than as part of the
complete database.
As example application, this approach was applied to tag data obtained for differ-
ent media types [ 38 ]. It was shown that using only tag information, coherent groups
of media, e.g., photos, can be discovered.
5.4.5
Outlier Detection
All databases contain outliers, but defining what an outlier exactly is and detecting
them are well-known to be challenging tasks. By assuming that the number of outliers
is small, and given the intuition of what an outlier is this seems a safe assumption,
we know that the largest part of a dataset is 'normal'. Hence, a model induced on the
database should capture primarily what is normal, and not so much what is an outlier.
Then, outlier detection can be formalized as a one-class classification problem: all
tuples that are compressed well belong to the 'normal' distribution, while tuples that
get a long encoding may be considered outliers. For transactional data, this approach
performs on par with the state-of-the-art of the field [ 58 ].
5.5
The Advantage of Pattern-based Models
For each and every of these tasks, we have to point out the added benefit of using
a pattern-based model. Besides obtaining competitive, state-of-the-art performance,
these patterns help to characterize decisions. For example, in the case of outlier
detection, we can identify why a tuple is identified as an anomaly by pointing out
the patterns of the norm it does not comply with, as well as how strongly it is an
anomaly—how much effort we have to do in order to make it 'normal'. Similar
advantages hold for the classification task. For the clustering related tasks, we have
the added benefit that we can offer specialized code tables, specialized descriptions
per subpart of the data; we are not only told which parts of the data should go together,
but also why, what patterns make these data points similar.
Search WWH ::




Custom Search