Outlier Detection - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

Detection of Univariate Outliers Based

on Normal Distribution

Data involving only one attribute or variable are called univariatedata . For simplicity,

we often choose to assume that data are generated from a normal distribution. We can

then learn the parameters of the normal distribution from the input data, and identify

the points with low probability as outliers.

Let's start with univariate data. We will try to detect outliers by assuming the data

follow a normal distribution.

Example 12.8 Univariate outlier detection using maximum likelihood. Suppose a city's average tem-

perature values in July in the last 10 years are, in value-ascending order, 24.0 C, 28.9 C,

28.9 C, 29.0 C, 29.1 C, 29.1 C, 29.2 C, 29.2 C, 29.3 C, and 29.4 C. Let's assume that

the average temperature follows a normal distribution, which is determined by two

parameters: the mean,

We can use the maximumlikelihoodmethod to estimate the parameters

, and the standard deviation,

and

. That

is, we maximize the log-likelihoodfunction

n X

i D1 .

2 ,

ln L

ln f

x i j.

//D

x i /

(12.1)

i D1

where n is the total number of samples, which is 10 in this example.

Taking derivatives with respect to

2 and solving the resulting system of first-

order conditions leads to the following maximumlikelihoodestimates :

and

n X

O D x D

x i

(12.2)

i D1

n X

i D1 .

2 D

2 .

x i x

(12.3)

In this example, we have

24.0C28.9C28.9C29.0C29.1C29.1C29.2C29.2C29.3C29.4

O D

D 28.61

2 D..

2 C.

24.128.61

28.928.61

29.028.61

29.128.61

29.228.61

2 C.

29.328.61

29.428.61

10w2.29.

Accordingly, we have O D p 2.29 D 1.51.

The most deviating value, 24.0 C, is 4.61 C away from the estimated mean. We

know that the

region contains 99.7% data under the assumption of normal

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home