Information Technology Reference
In-Depth Information
sample with values given in Table 4.3 is also added. Judging by Figure 4.4, it is clear
that the wood sample of unknown origin should be classified as Opor . The piece of
furniture is therefore not made of stinkwood, but most probably of imported imbuia.
Using the biplot as a graphical display in the classification process showed the amount
of separation obtained between the species. Furthermore, the Pythagorean distances from
the new sample to the class means are Mahalanobis distances that can be used as a
measure of certainty associated with the classification. If the new sample point fell just
inside the Opor classification region, but almost on the border with Obul , it could be
argued that it is not clear whether the piece of furniture is from expensive stinkwood
or the less valuable imbuia. Since the observation is well inside the Opor classification
region, it can be assumed with some confidence that the wood type is indeed imbuia and
not stinkwood.
In this example, we have mentioned that the means are positioned exactly, together
with exact classification regions; the individual samples are approximations obtained by
projection from their exact three-dimensional positions. The accuracy of the means is
valid only because three points can always be set in two dimensions. If we had had more
than three species then a two-dimensional representation of the means would also be
a projected approximation and the classification regions too would be approximations.
This is treated more fully below.
4.2 Understanding CVA and constructing its biplot
0 . The data contained in X
consist of p measurements made for each of K classes. The class sizes are n 1 , n 2 , ... , n K ,
p centred such that 1 X
Consider the data matrix X : n
×
=
respectively, such that k = 1 n k
=
n . Let N
=
diag
(
n 1 , n 2 ,
...
, n K )
. Then the matrix of
group means can be calculated as
N 1 G X
G G
) 1 G X ,
X : K
×
p
=
= (
where G : n × K denotes an indicator matrix defining membership of the K classes.
The sums-of-squares-and-products (SSP) matrix of X can be partitioned into a within-
class SSP matrix and between-class SSP matrix such that T
=
W
+
B (Total
=
Within
+ Between), where
W = X X X NX = X [ I G ( G G ) 1 G ] X
(4.1)
and
B = X NX = X G ( G G ) 1 G X .
(4.2)
The crucial thing is to find a transformation of the variables such that the Pythagorean
distances between the group means of the transformed variables are Mahalanobis dis-
tances. Writing x k and x h forthemeansofthe k th and h th groups, respectively, then the
Mahalanobis distance
δ kh between the two group means is given by
( x k x h ) W 1
δ kh =
( x k x h ) ,
p such that x LL x
so we are looking for a nonsingular transformation matrix L : p
×
=
x W 1 x or LL =
W 1 . Consider the eigenvector equation
WL = L ,
(4.3)
Search WWH ::




Custom Search