Canonical variate analysis biplots - Understanding Biplots

Information Technology Reference

In-Depth Information

means in two dimensions; only orientation changes. In general, classifying using the

weighted form carries with it an assumption that future samples will occur at similar

rates to those given in N . This may not be so, and in any case one may be especially

interested in identifying members of the rarely occurring groups, so then the last thing

we want to do is to give rare groups low weights. Furthermore, allocation to a group

using low-dimensional approximations is not ideal, as exact Mahalanobis distances in

p dimensions are readily computed. Apart from the two cases just discussed, we may

also choose C = I , in which case we retain the weighted centroid, which coincides with

the centroid of all n samples, but do an unweighted PCA, with minor changes from the

other unweighted case. The only effect of these variants is to give different projections,

VJV derived from (4.4), equivalently MJM − 1

derived from (4.5) , but the predicted fit

X given by (4.6) remains valid in all cases.

Often the researcher will make a scatterplot of the first two canonical variates. We

show below that this plot can be enhanced by following the biplot format:

•

representing both the means and information on the original variables;

•

interpolating the original samples into the biplot;

•

ensuring an aspect ratio of one to allow for visual appraisal of Mahalanobis dis-

tance;

•

removing the scaffolding canonical axes scales but placing markers in the original

units of measurement on the biplot axes.

We have seen that an observation in the canonical space is classified to its nearest (in

the Mahalanobis sense) canonical mean. The CVA biplot is constructed in r ≤ m dimen-

sions based on the transformation Z = XM r and approximate r -dimensional convex

nearest-neighbour classification regions are given by

j ,

∈ R r

C [ t ] j

: d t (

−

z j )<

d t (

−

z h )

for all h

(4.8)

where d t (.) denotes the Pythagorean distance of its argument calculated in dimension

t = r , r + 1, ... , m .

These neighbour regions are of two kinds, depending on whether z j and z h are in m

or r dimensions. When r = m , (4.8) yields proper classification regions, but when r <

m , (4.8) gives nearest-neighbour regions in the approximation space.

As well as classification regions, asymptotic confidence regions, which turn out to

be circular, may be drawn around the canonical means. The canonical variables are

uncorrelated with unit variance. If we assume that the initial, and hence also the canonical

variables, originate from a normal distribution, it follows that the sum of squares of all

points ( z 1 , z 2 , ... , z r ) in r -dimensional canonical space follows a χ

r distribution. Thus,

≤ z 1 + z 2 +···+ z r ) = 0 . 95.

This represents an r -dimensional sphere. For example, when r =

we may seek say a 95% confidence region for which P (χ

(χ

≤

) =

2, P

9915

0 . 95 so the radius of the 95% confidence circle is √ 5 . 9915 = 2 . 4478. We may draw a

circle of this radius around each canonical mean which then may be used to supplement

the classification regions, when assigning a sample to its nearest group mean. If one is

interested in an informal significance test for differences between the groups, the radius

for the i th group should be divided by √ n i and then one can inspect for the degree of

overlap, if any, between the circles. When r = 1 the above confidence spheres reduce

Understanding Biplots

Search WWH ::

Custom Search

Home