Information Technology Reference
In-Depth Information
for the square root of the Manhattan distance,
x ik x jk ;
1
2
f k ( x ik , x jk ) =−
(9.3)
and for Clark's distance,
x ik
2
x jk
1
2
f k ( x ik , x jk ) =−
.
(9.4)
x ik +
x jk
All these distances are defined in terms of differences between two values of the same
variable, so there is no need to work in terms of deviations from the mean. However,
(9.2) and (9.3) depend on the scaling used, which, as explained in Sections 2.5 and 3.6,
makes it vital to use some form of normalization when combining measurements from
variables measured on incommensurable scales. Gower (1992) and Gower and Hand
(1996) consider scaling each variable to have unit sum of squares or unit range. In the
following, we assume that all quantitative variables have been prescaled to correct for
incommensurability. In our first example, we have scaled to unit range with the scaled
value
x ik for i
˜
=
1, 2,
...
, n and k
=
1, 2,
...
, p given by
x ik
x ik
=
max i ( x ik ) min i ( x ik ) .
(9.5)
Then we have used Pythagorean distance.
Due to the assumption of additive distance it can be assumed without loss of generality
that the variables are ordered such that the first p ( 1 )
are continuous and the remaining
are categorical, with p ( 1 ) + p ( 2 ) = p and
p ( 2 )
p
p
( 1 )
D =
D k +
D k
= D ( 1 ) + D ( 2 ) ,
(9.6)
k
=
1
k
=
p
( 1 ) +
1
where D k
,the ddistance matrix derived solely for the k th variable.
The matrix D ( 1 ) is calculated as before, using an additive Euclidean embeddable
distance measure on the p ( 1 ) continuous variables. If the k th variable is categorical, an
indicator matrix (see Chapter 8) G k : n × L k is formed with L k the number of category
levels for this variable. Each row of G k represents a sample such that
={
f k (
x ik , x jk ) }
1 fthe i th observation on variable k falls into category level h
0oth rw
g ih =
.
To calculate D ( 2 )
the matrix
G p ( 1 ) + 1
G p
...
n × L =
is formed, where L
L p .Ifthe L columns of G are, or are viewed
as, dichotomous variables, an obvious approach would be to derive D ( 2 ) as a matrix of
dissimilarities. However, coding multilevel categories as a series of dichotomous variables
leads to a situation where the number of negative matches (0 - 0) dominates the number
=
L p ( 1 ) + 1 + ... +
Search WWH ::




Custom Search