Information Technology Reference
In-Depth Information
residual sum of squares; and (iv) JCA which is as for (iii) but applying the adjustment
iteratively. Each of these successively reduces the residual sum of squares until in (iv) it is
globally minimized. However, (ii) and (iii) do not give an orthogonal analysis of variance
(total = fit + residual) so 'variance accounted for' should be treated with caution. Gower
and Hand (1996) give a detailed discussion of these issues, while Gower (2006) tabulates
several measures of fit.
8.4 Similarity matrices and the extended matching coefficient
We have expressed doubts concerning the usefulness of approximating p 1 / 2 GL 1 / 2
via chi-squared distance. Because G is a binary matrix, another possibility is to use
any dissimilarity coefficient for binary variables to determine an n × n dissimilarity
matrix and analyse it by any form of MDS. A problem with this is that, because of its
underlying p -variable structure, G is not a conventional binary matrix. What is needed is
a dissimilarity coefficient that respects this structure. Gower and Hand (1996) suggested
the extended matching coefficient (EMC) which expresses the number of matches for
every pair of samples as a ratio of the number p of variables. Thus the proportion of
matches is given by the nondiagonal values of
GG / p
(8.14)
and the corresponding proportion of dissimilarities by
11 GG / p .
(8.15)
When every variable has two levels, the EMC coincides with the simple matching
coefficient.
It is of some interest to compare the simple properties of the EMC with what happens
if G is treated as a binary matrix, ignoring the structure of the underlying categorical
variables. To be a little more explicit about the EMC, for some pair of rows of G ,let
a be the number of 1 - 1 matches, b the number of 1 - 0 matches, c the number of 0 - 1
matches and d the number of 0 - 0 matches. Thus a + b + c + d = L . The number of
positive matches is a out of p variables, giving an EMC of E = a / p . The relationship
with the calculation of a conventional similarity coefficient is easy to determine. For, if
there are a 1 - 1 matches then there must be p - a 1 - 0 and 0 - 1 mismatches, leaving a
remainder of L 2 p + a 0 - 0 matches. Thus, if we use a similarity coefficient of the
Jaccard family S
= a /( a + θ { b + c } )
= a /( a +
θ { p a }
, it takes the special form S
2
),
depending solely on a and hence equivalently on E .Wehavethat
E
/
S
= (
a
+
2
θ {
p
a
} )/
p
= (
1
2
θ)
E
+
2
θ
,
showing that E and S are monotonically related. A similarity coefficient from the
simple matching family, S
= ( a + d )/( a + d + θ { b + c } )
takes the special form S
=
( L 2 { p a } )/( L 2 { 1 θ }{ p a } ) .Now,wehave
1 / S
={ 1 θ }+{ θ L } /( L 2 p { 1 E } ) ,
again showing a monotonic relationship between E and S . Nonmetric methods of analysis
are invariant to monotonically related data, so then it makes no difference what measure
Search WWH ::




Custom Search