Databases Reference
In-Depth Information
and
(
P
o
0
2
C
j
dist
)
o
,
o
0
.
/
b
.
o
/D
min
C
j
:1
j
k
,
j
6D
i
.
(10.32)
j
C
j
j
The
silhouette coefficient
of
o
is then defined as
b
.
o
/
a
.
o
/
s
.
o
/D
.
(10.33)
max
f
a
.
o
),
b
.
o
/g
reflects
the compactness of the cluster to which
o
belongs. The smaller the value, the more com-
pact the cluster. The value of
b
(
o
) captures the degree to which
o
is separated from other
clusters. The larger
b
(
o
) is, the more separated
o
is from other clusters. Therefore, when
the silhouette coefficient value of
o
approaches 1, the cluster containing
o
is compact
and
o
is far away from other clusters, which is the preferable case. However, when the
silhouette coefficient value is negative (i.e.,
b
(
o
)
The value of the silhouette coefficient is between1 and 1. The value of
a
o
.
/
a(
o
)), this means that, in expectation,
o
is closer to the objects in another cluster than to the objects in the same cluster as
o
.
In many cases, this is a bad situation and should be avoided.
To measure a cluster's fitness within a clustering, we can compute the average silhou-
ette coefficient value of all objects in the cluster. To measure the quality of a clustering,
we can use the average silhouette coefficient value of all objects in the data set. The sil-
houette coefficient and other intrinsic measures can also be used in the elbow method
to heuristically derive the number of clusters in a data set by replacing the sum of
within-cluster variances.
<
10.7
Summary
A
cluster
is a collection of data objects that are
similar
to one another within the same
cluster and are
dissimilar
to the objects in other clusters. The process of grouping a
set of physical or abstract objects into classes of
similar
objects is called
clustering
.
Cluster analysis has extensive
applications
, including business intelligence, image
pattern recognition, Web search, biology, and security. Cluster analysis can be used
as a standalone data mining tool to gain insight into the data distribution, or as
a preprocessing step for other data mining algorithms operating on the detected
clusters.
Clustering is a dynamic field of research in data mining. It is related to
unsupervised
learning
in machine learning.
Clustering is a challenging field. Typical
requirements
of it include scalability, the
ability to deal with different types of data and attributes, the discovery of clus-
ters in arbitrary shape, minimal requirements for domain knowledge to determine
input parameters, the ability to deal with noisy data, incremental clustering and