Databases Reference
In-Depth Information
and
( P o 0 2 C j dist
)
o , o 0
.
/
b
.
o
/D min
C j :1 j k , j 6D i
.
(10.32)
j C j j
The silhouette coefficient of o is then defined as
b
.
o
/ a
.
o
/
s
.
o
/D
.
(10.33)
max f a
.
o ), b
.
o
/g
reflects
the compactness of the cluster to which o belongs. The smaller the value, the more com-
pact the cluster. The value of b ( o ) captures the degree to which o is separated from other
clusters. The larger b ( o ) is, the more separated o is from other clusters. Therefore, when
the silhouette coefficient value of o approaches 1, the cluster containing o is compact
and o is far away from other clusters, which is the preferable case. However, when the
silhouette coefficient value is negative (i.e., b ( o )
The value of the silhouette coefficient is between1 and 1. The value of a
o
.
/
a( o )), this means that, in expectation,
o is closer to the objects in another cluster than to the objects in the same cluster as o .
In many cases, this is a bad situation and should be avoided.
To measure a cluster's fitness within a clustering, we can compute the average silhou-
ette coefficient value of all objects in the cluster. To measure the quality of a clustering,
we can use the average silhouette coefficient value of all objects in the data set. The sil-
houette coefficient and other intrinsic measures can also be used in the elbow method
to heuristically derive the number of clusters in a data set by replacing the sum of
within-cluster variances.
<
10.7 Summary
A cluster is a collection of data objects that are similar to one another within the same
cluster and are dissimilar to the objects in other clusters. The process of grouping a
set of physical or abstract objects into classes of similar objects is called clustering .
Cluster analysis has extensive applications , including business intelligence, image
pattern recognition, Web search, biology, and security. Cluster analysis can be used
as a standalone data mining tool to gain insight into the data distribution, or as
a preprocessing step for other data mining algorithms operating on the detected
clusters.
Clustering is a dynamic field of research in data mining. It is related to unsupervised
learning in machine learning.
Clustering is a challenging field. Typical requirements of it include scalability, the
ability to deal with different types of data and attributes, the discovery of clus-
ters in arbitrary shape, minimal requirements for domain knowledge to determine
input parameters, the ability to deal with noisy data, incremental clustering and
 
Search WWH ::




Custom Search