Cluster Analysis: Basic Concepts and Methods - Data Mining: Concepts and Techniques

Databases Reference

In-Depth Information

and

( P o 0 2 C j dist

)

o , o 0

.

/

b

.

o

/D min

C j :1 j k , j 6D i

.

(10.32)

j C j j

The silhouette coefficient of o is then defined as

b

.

o

/ a

.

o

/

s

.

o

/D

.

(10.33)

max f a

.

o ), b

.

o

/g

reflects

the compactness of the cluster to which o belongs. The smaller the value, the more com-

pact the cluster. The value of b ( o ) captures the degree to which o is separated from other

clusters. The larger b ( o ) is, the more separated o is from other clusters. Therefore, when

the silhouette coefficient value of o approaches 1, the cluster containing o is compact

and o is far away from other clusters, which is the preferable case. However, when the

silhouette coefficient value is negative (i.e., b ( o )

The value of the silhouette coefficient is between1 and 1. The value of a

o

.

/

a( o )), this means that, in expectation,

o is closer to the objects in another cluster than to the objects in the same cluster as o .

In many cases, this is a bad situation and should be avoided.

To measure a cluster's fitness within a clustering, we can compute the average silhou-

ette coefficient value of all objects in the cluster. To measure the quality of a clustering,

we can use the average silhouette coefficient value of all objects in the data set. The sil-

houette coefficient and other intrinsic measures can also be used in the elbow method

to heuristically derive the number of clusters in a data set by replacing the sum of

within-cluster variances.

<

10.7 Summary

A cluster is a collection of data objects that are similar to one another within the same

cluster and are dissimilar to the objects in other clusters. The process of grouping a

set of physical or abstract objects into classes of similar objects is called clustering .

Cluster analysis has extensive applications , including business intelligence, image

pattern recognition, Web search, biology, and security. Cluster analysis can be used

as a standalone data mining tool to gain insight into the data distribution, or as

a preprocessing step for other data mining algorithms operating on the detected

clusters.

Clustering is a dynamic field of research in data mining. It is related to unsupervised

learning in machine learning.

Clustering is a challenging field. Typical requirements of it include scalability, the

ability to deal with different types of data and attributes, the discovery of clus-

ters in arbitrary shape, minimal requirements for domain knowledge to determine

input parameters, the ability to deal with noisy data, incremental clustering and

Data Mining: Concepts and Techniques

Search WWH ::

Custom Search

Home