Databases Reference
In-Depth Information
12.9
Summary
Assume that a given statistical process is used to generate a set of data objects. An
outlier
is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
Types of outliers
include global outliers, contextual outliers, and collective outliers.
An object may be more than one type of outlier.
Global outliers
are the simplest form of outlier and the easiest to detect. A
contextual
outlier
deviates significantly with respect to a specific context of the object (e.g., a
Toronto temperature value of 28
C is an outlier if it occurs in the context of winter).
A subset of data objects forms a
collective outlier
if the objects as a whole deviate
significantly from the entire data set, even though the individual data objects may not
be outliers. Collective outlier detection requires background information to model
the relationships among objects to find outlier groups.
Challenges
in outlier detection include finding appropriate data models, the depen-
dence of outlier detection systems on the application involved, finding ways to
distinguish outliers from noise, and providing justification for identifying outliers
as such.
Outlier detection methods can be
categorized
according to whether the sample
of data for analysis is given with expert-provided labels that can be used to build
an outlier detection model. In this case, the detection methods are
supervised,
semi-supervised
, or
unsupervised
. Alternatively, outlier detection methods may be
organized according to their assumptions regarding normal objects versus out-
liers. This categorization includes
statistical
methods,
proximity-based
methods, and
clustering-based
methods.
Statistical outlier detection methods
(or
model-based methods
) assume that the
normal data objects follow a statistical model, where data not following the model
are considered outliers. Such methods may be
parametric
(they assume that the data
are generated by a parametric distribution) or
nonparametric
(they learn a model for
the data, rather than assuming one a priori). Parametric methods for multivariate
data may employ the Mahalanobis distance, the
2
-statistic, or a mixture of mul-
tiple parametric models. Histograms and kernel density estimation are examples of
nonparametric methods.
Proximity-based outlier detection methods
assume that an object is an outlier
if the proximity of the object to its nearest neighbors significantly deviates from
the proximity of most of the other objects to their neighbors in the same data
set.
Distance-basedoutlierdetectionmethods
consult the
neighborhood
of an object,
defined by a given radius. An object is an outlier if its neighborhood does not have
enough other points. In
density-basedoutlierdetectionmethods
, an object is an outlier
if its density is relatively much lower than that of its neighbors.