Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Statistical Methods

Statistical methods (also known as model-based methods ) make assumptions of

data normality. They assume that normal data objects are generated by a statistical

(stochastic) model, and that data not following the model are outliers.

Example 12.5 Detecting outliers using a statistical (Gaussian) model. In Figure 12.1, the data points

except for those in region R fit a Gaussian distribution g D , where for a location x in the

data space, g D .

gives the probability density at x . Thus, the Gaussian distribution g D

can be used to model the normal data, that is, most of the data points in the data set. For

each object y in region, R , we can estimate g D .

x

/

y

/

, the probability that this point fits the

Gaussian distribution. Because g D .

y

/

is very low, y is unlikely generated by the Gaussian

model, and thus is an outlier.

The effectiveness of statistical methods highly depends on whether the assumptions

made for the statistical model hold true for the given data. There are many kinds of

statistical models. For example, the statistic models used in the methods may be para-

metric or nonparametric. Statistical methods for outlier detection are discussed in detail

in Section 12.3.

Proximity-Based Methods

Proximity-based methods assume that an object is an outlier if the nearest neighbors

of the object are far away in feature space, that is, the proximity of the object to its

neighbors significantly deviates from the proximity of most of the other objects to their

neighbors in the same data set.

Example 12.6 Detecting outliers using proximity. Consider the objects in Figure 12.1 again. If we

model the proximity of an object using its three nearest neighbors, then the objects

in region R are substantially different from other objects in the data set. For the two

objects in R , their second and third nearest neighbors are dramatically more remote

than those of any other objects. Therefore, we can label the objects in R as outliers based

on proximity.

The effectiveness of proximity-based methods relies heavily on the proximity (or dis-

tance) measure used. In some applications, such measures cannot be easily obtained.

Moreover, proximity-based methods often have difficulty in detecting a group of outliers

if the outliers are close to one another.

There are two major types of proximity-based outlier detection, namely distance-

based and density-based outlier detection. Proximity-based outlier detection is discussed

in Section 12.4.

Clustering-Based Methods

Clustering-based methods assume that the normal data objects belong to large and

dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to

any clusters.

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home