Databases Reference
In-Depth Information
CBLOF defines the similarity between a point and a cluster in a statistical way that
represents the probability that the point belongs to the cluster. The larger the value, the
more similar the point and the cluster are. The CBLOF score can detect outlier points
that are far from any clusters. In addition, small clusters that are far from any large
cluster are considered to consist of outliers.
The points with the lowest CBLOF scores
are suspected outliers.
Example 12.18 Detecting outliers in small clusters. The data points in Figure 12.12 form three clusters:
large clusters, C 1 and C 2 , and a small cluster, C 3 . Object o does not belong to any cluster.
Using CBLOF, FindCBLOF can identify o as well as the points in cluster C 3 as outliers.
For o , the closest large cluster is C 1 . The CBLOF is simply the similarity between o and
C 1 , which is small. For the points in C 3 , the closest large cluster is C 2 . Although there
are three points in cluster C 3 , the similarity between those points and cluster C 2 is low,
and j C 3 jD 3 is small; thus, the CBLOF scores of points in C 3 are small.
Clustering-based approaches may incur high computational costs if they have to find
clusters before detecting outliers. Several techniques have been developed for improved
efficiency. For example, fixed-width clustering is a linear-time technique that is used in
some outlier detection methods. The idea is simple yet efficient. A point is assigned to
a cluster if the center of the cluster is within a predefined distance threshold from the
point. If a point cannot be assigned to any existing cluster, a new cluster is created. The
distance threshold may be learned from the training data under certain conditions.
Clustering-based outlier detection methods have the following advantages. First, they
can detect outliers without requiring any labeled data, that is, in an unsupervised way.
They work for many data types. Clusters can be regarded as summaries of the data.
Once the clusters are obtained, clustering-based methods need only compare any object
against the clusters to determine whether the object is an outlier. This process is typically
fast because the number of clusters is usually small compared to the total number of
objects.
C 3
C 1
C 2
o
Figure 12.12 Outliers in small clusters.
 
Search WWH ::




Custom Search