Databases Reference
In-Depth Information
Statistical Methods
Statistical methods (also known as model-based methods ) make assumptions of
data normality. They assume that normal data objects are generated by a statistical
(stochastic) model, and that data not following the model are outliers.
Example 12.5 Detecting outliers using a statistical (Gaussian) model. In Figure 12.1, the data points
except for those in region R fit a Gaussian distribution g D , where for a location x in the
data space, g D .
gives the probability density at x . Thus, the Gaussian distribution g D
can be used to model the normal data, that is, most of the data points in the data set. For
each object y in region, R , we can estimate g D .
x
/
y
/
, the probability that this point fits the
Gaussian distribution. Because g D .
y
/
is very low, y is unlikely generated by the Gaussian
model, and thus is an outlier.
The effectiveness of statistical methods highly depends on whether the assumptions
made for the statistical model hold true for the given data. There are many kinds of
statistical models. For example, the statistic models used in the methods may be para-
metric or nonparametric. Statistical methods for outlier detection are discussed in detail
in Section 12.3.
Proximity-Based Methods
Proximity-based methods assume that an object is an outlier if the nearest neighbors
of the object are far away in feature space, that is, the proximity of the object to its
neighbors significantly deviates from the proximity of most of the other objects to their
neighbors in the same data set.
Example 12.6 Detecting outliers using proximity. Consider the objects in Figure 12.1 again. If we
model the proximity of an object using its three nearest neighbors, then the objects
in region R are substantially different from other objects in the data set. For the two
objects in R , their second and third nearest neighbors are dramatically more remote
than those of any other objects. Therefore, we can label the objects in R as outliers based
on proximity.
The effectiveness of proximity-based methods relies heavily on the proximity (or dis-
tance) measure used. In some applications, such measures cannot be easily obtained.
Moreover, proximity-based methods often have difficulty in detecting a group of outliers
if the outliers are close to one another.
There are two major types of proximity-based outlier detection, namely distance-
based and density-based outlier detection. Proximity-based outlier detection is discussed
in Section 12.4.
Clustering-Based Methods
Clustering-based methods assume that the normal data objects belong to large and
dense clusters, whereas outliers belong to small or sparse clusters, or do not belong to
any clusters.
 
Search WWH ::




Custom Search