Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

12.9 Summary

Assume that a given statistical process is used to generate a set of data objects. An

outlier is a data object that deviates significantly from the rest of the objects, as if it

were generated by a different mechanism.

Types of outliers include global outliers, contextual outliers, and collective outliers.

An object may be more than one type of outlier.

Global outliers are the simplest form of outlier and the easiest to detect. A contextual

outlier deviates significantly with respect to a specific context of the object (e.g., a

Toronto temperature value of 28 C is an outlier if it occurs in the context of winter).

A subset of data objects forms a collective outlier if the objects as a whole deviate

significantly from the entire data set, even though the individual data objects may not

be outliers. Collective outlier detection requires background information to model

the relationships among objects to find outlier groups.

Challenges in outlier detection include finding appropriate data models, the depen-

dence of outlier detection systems on the application involved, finding ways to

distinguish outliers from noise, and providing justification for identifying outliers

as such.

Outlier detection methods can be categorized according to whether the sample

of data for analysis is given with expert-provided labels that can be used to build

an outlier detection model. In this case, the detection methods are supervised,

semi-supervised , or unsupervised . Alternatively, outlier detection methods may be

organized according to their assumptions regarding normal objects versus out-

liers. This categorization includes statistical methods, proximity-based methods, and

clustering-based methods.

Statistical outlier detection methods (or model-based methods ) assume that the

normal data objects follow a statistical model, where data not following the model

are considered outliers. Such methods may be parametric (they assume that the data

are generated by a parametric distribution) or nonparametric (they learn a model for

the data, rather than assuming one a priori). Parametric methods for multivariate

data may employ the Mahalanobis distance, the

2 -statistic, or a mixture of mul-

tiple parametric models. Histograms and kernel density estimation are examples of

nonparametric methods.

Proximity-based outlier detection methods assume that an object is an outlier

if the proximity of the object to its nearest neighbors significantly deviates from

the proximity of most of the other objects to their neighbors in the same data

set. Distance-basedoutlierdetectionmethods consult the neighborhood of an object,

defined by a given radius. An object is an outlier if its neighborhood does not have

enough other points. In density-basedoutlierdetectionmethods , an object is an outlier

if its density is relatively much lower than that of its neighbors.

Search WWH ::

Custom Search

Home