Information Technology Reference
In-Depth Information
means in two dimensions; only orientation changes. In general, classifying using the
weighted form carries with it an assumption that future samples will occur at similar
rates to those given in N . This may not be so, and in any case one may be especially
interested in identifying members of the rarely occurring groups, so then the last thing
we want to do is to give rare groups low weights. Furthermore, allocation to a group
using low-dimensional approximations is not ideal, as exact Mahalanobis distances in
p dimensions are readily computed. Apart from the two cases just discussed, we may
also choose C = I , in which case we retain the weighted centroid, which coincides with
the centroid of all n samples, but do an unweighted PCA, with minor changes from the
other unweighted case. The only effect of these variants is to give different projections,
VJV derived from (4.4), equivalently MJM 1
derived from (4.5) , but the predicted fit
X given by (4.6) remains valid in all cases.
Often the researcher will make a scatterplot of the first two canonical variates. We
show below that this plot can be enhanced by following the biplot format:
representing both the means and information on the original variables;
interpolating the original samples into the biplot;
ensuring an aspect ratio of one to allow for visual appraisal of Mahalanobis dis-
tance;
removing the scaffolding canonical axes scales but placing markers in the original
units of measurement on the biplot axes.
We have seen that an observation in the canonical space is classified to its nearest (in
the Mahalanobis sense) canonical mean. The CVA biplot is constructed in r m dimen-
sions based on the transformation Z = XM r and approximate r -dimensional convex
nearest-neighbour classification regions are given by
z
j ,
R r
C [ t ] j
=
: d t (
z
z j )<
d t (
z
z h )
for all h
=
(4.8)
where d t (.) denotes the Pythagorean distance of its argument calculated in dimension
t = r , r + 1, ... , m .
These neighbour regions are of two kinds, depending on whether z j and z h are in m
or r dimensions. When r = m , (4.8) yields proper classification regions, but when r <
m , (4.8) gives nearest-neighbour regions in the approximation space.
As well as classification regions, asymptotic confidence regions, which turn out to
be circular, may be drawn around the canonical means. The canonical variables are
uncorrelated with unit variance. If we assume that the initial, and hence also the canonical
variables, originate from a normal distribution, it follows that the sum of squares of all
points ( z 1 , z 2 , ... , z r ) in r -dimensional canonical space follows a χ
r distribution. Thus,
z 1 + z 2 +···+ z r ) = 0 . 95.
This represents an r -dimensional sphere. For example, when r =
2
r
we may seek say a 95% confidence region for which P
2
2
.
) =
2, P
5
9915
0 . 95 so the radius of the 95% confidence circle is 5 . 9915 = 2 . 4478. We may draw a
circle of this radius around each canonical mean which then may be used to supplement
the classification regions, when assigning a sample to its nearest group mean. If one is
interested in an informal significance test for differences between the groups, the radius
for the i th group should be divided by n i and then one can inspect for the degree of
overlap, if any, between the circles. When r = 1 the above confidence spheres reduce
Search WWH ::




Custom Search