Information Technology Reference
In-Depth Information
Minkowski distances computes the distance of documents d and d
by Equation 21
(for n = 2 it is converted to Euclidean distance).
1
/
n
t
(
)
n
=
D
d
,
d
=
d
d
(21)
n
i
i
i
1
d T
d
The cosine measure is defined by Equation 22 where
is the inner product
(dot-product) of two vectors.
T
d
d
(
)
cos
d
,
d
=
(22)
d
d
Both metrics are widely used in the text document clustering literatures. It seems,
however, in the cases when the number of dimensions of two vectors differs largely,
the cosine is more useful. In cases where two vectors have almost the same dimen-
sion, Minkowski distance can be useful.
As mentioned above, the objective of a clustering algorithm is to maximize the in-
tra-similarity and minimize the inter-similarity. The cohesiveness of clusters can be
used as a measure of cluster similarity. The quality of a specific clustering can be de-
termined by Average Distance of Documents to the Cluster Centroid (ADDC) repre-
sented by that clustering. This value is measured by the equation:
n
(
)
i
D
c
,
d
i
ij
k
j
=
1
n
i
=
1
i
f
=
(23)
K
where K is the number of clusters;
n is the numbers of documents in cluster i (e.g.,
); D is distance function, i d is the j th document of cluster i ,
and the c in the centroid of i th cluster which is the average of all document vectors in
the cluster.
Since our goal is to optimize an objective function, clustering is essentially a
search problem. Let us stress that dividing n data into K clusters gives rise to a huge
number of possible partitions, which is expressed in the form of the Stirling number:
n
=
a
(
j
n
)
i
ij
K
K
1
()
=
K
i
n
1
i
(24)
K
!
i
i
1
This illustrates that the clustering by examining all possible partitions of n docu-
ments of t -dimensions into K clusters is not computationally feasible. Obviously, we
need to resort to some optimization techniques to reduce the search space, but then
there is no guarantee that the optimal solution will be found. Methods such as GA [36,
41], self-organizing maps (SOM) [42] and ant clustering [43] have been used for
 
Search WWH ::




Custom Search