Databases Reference
In-Depth Information
12.9 Summary
Assume that a given statistical process is used to generate a set of data objects. An
outlier is a data object that deviates significantly from the rest of the objects, as if it
were generated by a different mechanism.
Types of outliers include global outliers, contextual outliers, and collective outliers.
An object may be more than one type of outlier.
Global outliers are the simplest form of outlier and the easiest to detect. A contextual
outlier deviates significantly with respect to a specific context of the object (e.g., a
Toronto temperature value of 28 C is an outlier if it occurs in the context of winter).
A subset of data objects forms a collective outlier if the objects as a whole deviate
significantly from the entire data set, even though the individual data objects may not
be outliers. Collective outlier detection requires background information to model
the relationships among objects to find outlier groups.
Challenges in outlier detection include finding appropriate data models, the depen-
dence of outlier detection systems on the application involved, finding ways to
distinguish outliers from noise, and providing justification for identifying outliers
as such.
Outlier detection methods can be categorized according to whether the sample
of data for analysis is given with expert-provided labels that can be used to build
an outlier detection model. In this case, the detection methods are supervised,
semi-supervised , or unsupervised . Alternatively, outlier detection methods may be
organized according to their assumptions regarding normal objects versus out-
liers. This categorization includes statistical methods, proximity-based methods, and
clustering-based methods.
Statistical outlier detection methods (or model-based methods ) assume that the
normal data objects follow a statistical model, where data not following the model
are considered outliers. Such methods may be parametric (they assume that the data
are generated by a parametric distribution) or nonparametric (they learn a model for
the data, rather than assuming one a priori). Parametric methods for multivariate
data may employ the Mahalanobis distance, the
2 -statistic, or a mixture of mul-
tiple parametric models. Histograms and kernel density estimation are examples of
nonparametric methods.
Proximity-based outlier detection methods assume that an object is an outlier
if the proximity of the object to its nearest neighbors significantly deviates from
the proximity of most of the other objects to their neighbors in the same data
set. Distance-basedoutlierdetectionmethods consult the neighborhood of an object,
defined by a given radius. An object is an outlier if its neighborhood does not have
enough other points. In density-basedoutlierdetectionmethods , an object is an outlier
if its density is relatively much lower than that of its neighbors.
 
Search WWH ::




Custom Search