Information Technology Reference
In-Depth Information
This degree of similarity is defined by a similarity or dissimilarity measure,
where measure is allowed to be interpreted here in a lax sense, as some
monotonic set function d ( C i ,C j ) whose minimum value is d ( C i ,C i ) for any
C i .
The most common dissimilarity measure between two real-valued vectors
x and y ,istheweighted L p metric,
d p ( x , y )= d
p p
w i |
x i
y i |
,
(6.50)
i =1
where x i and y i are the i th coordinates of x and y , i =1 , ..., d ,and w i
0 is
the i th weight coecient. The unweighted ( w =1) L p metric is also known as
Minkowski distance of order p ( p
1). Examples of this distance are the well-
known Euclidian distance — the most common dissimilarity measure used by
clustering algorithms —, obtained by setting p =2, the Manhattan distance,
for p =1,andthe L or Chebyshev distance. The LEGClust algorithm also
uses a dissimilarity measure, although defined in an unconventional way. Dis-
similarities between objects x i and x j , for all objects represented by a set of
vectors
d , are conveniently arranged in a dissimilarity
{ x 1 , x 2 ,..., x n }
, x i R
n×n , where each element of A is a i,j = d ( x i , x j ) with d ( x i , x j )
the dissimilarity between x i and x j (in rigor, d (
matrix A R
{ x i }
,
{ x j }
) for the singleton
sets
{ x i }
and
{ x j }
).
6.4.4 The LEGClust Algorithm
As mentioned earlier, clustering solutions may vary widely with the algorithm
being used and, for the same algorithm, with its specific settings. People also
cluster data differently according to their knowledge, perspective or experi-
ence. In [201] some clustering tests were performed involving several types of
individuals in order to try to understand the mental process of data cluster-
ing. The tests used two-dimensional datasets similar to those to be presented
in Sect. 6.4.5. Figure 6.20 shows one such dataset with different clustering
solutions suggested by different individuals.
The most important conclusion presented in [201] was that human clus-
tering exhibits some balance between the importance given to local (e.g.,
connectedness) and global (e.g., structuring direction) features of the data.
The tests also provided majority choices of clustering solutions that one can
use to compare the results of different clustering algorithms.
The following sections describe the LEGClust algorithm, first presented
in [204]. We first introduce the entropic dissimilarity matrix and, based on
that, the computation of the so-called layered entropic proximity matrix.
 
Search WWH ::




Custom Search