Databases Reference
In-Depth Information
Detection of Univariate Outliers Based
on Normal Distribution
Data involving only one attribute or variable are called univariatedata . For simplicity,
we often choose to assume that data are generated from a normal distribution. We can
then learn the parameters of the normal distribution from the input data, and identify
the points with low probability as outliers.
Let's start with univariate data. We will try to detect outliers by assuming the data
follow a normal distribution.
Example 12.8 Univariate outlier detection using maximum likelihood. Suppose a city's average tem-
perature values in July in the last 10 years are, in value-ascending order, 24.0 C, 28.9 C,
28.9 C, 29.0 C, 29.1 C, 29.1 C, 29.2 C, 29.2 C, 29.3 C, and 29.4 C. Let's assume that
the average temperature follows a normal distribution, which is determined by two
parameters: the mean,
.
We can use the maximumlikelihoodmethod to estimate the parameters
, and the standard deviation,
and
. That
is, we maximize the log-likelihoodfunction
n X
n X
i D1 .
n
2
n
2
1
2
2
2
2 ,
ln L
.
,
/D
ln f
.
x i j.
,
//D
ln
.
2
/
ln
x i /
(12.1)
2
2
i D1
where n is the total number of samples, which is 10 in this example.
Taking derivatives with respect to
2 and solving the resulting system of first-
order conditions leads to the following maximumlikelihoodestimates :
and
n X
1
n
O D x D
x i
(12.2)
i D1
n X
i D1 .
1
n
2 D
2 .
O
x i x
/
(12.3)
In this example, we have
24.0C28.9C28.9C29.0C29.1C29.1C29.2C29.2C29.3C29.4
10
O D
D 28.61
2 D..
2 C.
2 C.
2 C.
2
O
24.128.61
/
28.928.61
/
28.928.61
/
29.028.61
/
2
2
2
2
C.
29.128.61
/
C.
29.128.61
/
C.
29.228.61
/
C.
29.228.61
/
2 C.
2
C.
29.328.61
/
29.428.61
/
/=
10w2.29.
Accordingly, we have O D p 2.29 D 1.51.
The most deviating value, 24.0 C, is 4.61 C away from the estimated mean. We
know that the
3
region contains 99.7% data under the assumption of normal
 
Search WWH ::




Custom Search