Databases Reference
In-Depth Information
o
C 3
C 1
C 2
Figure 12.4 A complex data set.
Example 12.11 Multivariate outlier detection using multiple parametric distributions. Consider the
data set in Figure 12.4. There are two big clusters, C 1 and C 2 . To assume that the data
are generated by a normal distribution would not work well here. The estimated mean
is located between the two clusters and not inside any cluster. The objects between the
two clusters cannot be detected as outliers since they are close to the mean.
To overcome this problem, we can instead assume that the normal data objects are
generated by multiple normal distributions, two in this case. That is, we assume two
normal distributions,
. For any object, o , in the data set, the
probability that o is generated by the mixture of the two distributions is given by
2 1 . 1 ,
1 /
and
2 2 . 2 ,
2 /
Pr
.
o j2 1 ,
2 2 /D f
.
o
/C f
.
o
/
,
2 1
2 2
where f
2 2 , respectively. We
can use the expectation-maximization (EM) algorithm (Chapter 11) to learn the param-
eters
and f
are the probability density functions of
2 1 and
2 1
2 2
2 from the data, as we do in mixture models for clustering. Each cluster
is represented by a learned normal distribution. An object, o , is detected as an outlier if
it does not belong to any cluster, that is, the probability is very low that it was generated
by the combination of the two distributions.
1 ,
1 ,
2 ,
Example 12.12 Multivariate outlier detection using multiple clusters. Most of the data objects shown
in Figure 12.4 are in either C 1 or C 2 . Other objects, representing noise, are uniformly
distributed in the data space. A small cluster, C 3 , is highly suspicious because it is not
close to either of the two major clusters, C 1 and C 2 . The objects in C 3 should therefore
be detected as outliers.
Note that identifying the objects in C 3 as outliers is difficult, whether or not we
assume that the given data follow a normal distribution or a mixture of multiple dis-
tributions. This is because the probability of the objects in C 3 will be higher than some
of the noise objects, like o in Figure 12.4, due to a higher local density in C 3 .
 
Search WWH ::




Custom Search