Outlier Detection - Data Mining: Concepts and Techniques - page 570

Databases Reference

In-Depth Information

CBLOF defines the similarity between a point and a cluster in a statistical way that

represents the probability that the point belongs to the cluster. The larger the value, the

more similar the point and the cluster are. The CBLOF score can detect outlier points

that are far from any clusters. In addition, small clusters that are far from any large

cluster are considered to consist of outliers.

The points with the lowest CBLOF scores

are suspected outliers.

Example 12.18 Detecting outliers in small clusters. The data points in Figure 12.12 form three clusters:

large clusters, C 1 and C 2 , and a small cluster, C 3 . Object o does not belong to any cluster.

Using CBLOF, FindCBLOF can identify o as well as the points in cluster C 3 as outliers.

For o , the closest large cluster is C 1 . The CBLOF is simply the similarity between o and

C 1 , which is small. For the points in C 3 , the closest large cluster is C 2 . Although there

are three points in cluster C 3 , the similarity between those points and cluster C 2 is low,

and j C 3 jD 3 is small; thus, the CBLOF scores of points in C 3 are small.

Clustering-based approaches may incur high computational costs if they have to find

clusters before detecting outliers. Several techniques have been developed for improved

efficiency. For example, fixed-width clustering is a linear-time technique that is used in

some outlier detection methods. The idea is simple yet efficient. A point is assigned to

a cluster if the center of the cluster is within a predefined distance threshold from the

point. If a point cannot be assigned to any existing cluster, a new cluster is created. The

distance threshold may be learned from the training data under certain conditions.

Clustering-based outlier detection methods have the following advantages. First, they

can detect outliers without requiring any labeled data, that is, in an unsupervised way.

They work for many data types. Clusters can be regarded as summaries of the data.

Once the clusters are obtained, clustering-based methods need only compare any object

against the clusters to determine whether the object is an outlier. This process is typically

fast because the number of clusters is usually small compared to the total number of

objects.

C 3

C 1

C 2

o

Figure 12.12 Outliers in small clusters.

Next Page

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home