Information Technology Reference
In-Depth Information
similarly to the prediction biplot trajectories of the nonlinear biplot described in Chapter 5.
Since this example uses Pythagorean distance, all three prediction methods produce the
same linear biplot trajectory shown in Figure 9.8.
9.8 An example
We now return to the remuneration data introduced in Section 4.9.1 and also used in
Section 8.10 to illustrate a categorical PCA biplot where the continuous variables were
categorized. In a generalized biplot, the distinction between quantitative and qualita-
tive variables is retained. The same variables described in Section 8.10 are used in this
illustration but now treating Remun , Resrch , Age and AQual quantitatively. We use
Pythagorean distance for the continuous variables and the ECM for the categorical vari-
ables. As usual (see, for example, Section 9.2) we have to use some form of scaling for
the quantitative variables. Furthermore, due to the difference in the number of categories
of the qualitative variables, they too have to be normalized. The usual way to normalize
the quantitative variables is to centre and then scale each to unit sum of squares. Thus
we normalized each of Remun , Resrch , Age and AQual to unit sum of squares. Since
1 D q 1 for each quantitative variable is equal to n times the corrected sum of squares
for that variable, this normalization process is equivalent to dividing the ddistances D q
by ( 1 D q 1 )/ n . Equivalently, with the EMC, each qualitative variable was scaled so that
1 D k 1 =− n ,where D k
1
2
( 1 n 1 n G k G k ) . This type of normalization balances the
contributions of quantitative and qualitative variables to overall distance, but it is not the
only possibility.
It is clear from Figure 9.9 that prediction regions for all categories of a qualitative
variable are not necessarily represented in the biplot space: for academic position, pre-
diction regions for lecturer ( R 2), senior lecturer ( R 3) and full professor ( R 5) are visible
but not those for junior lecturer and associate professor, while prediction regions for only
four of the nine faculties appear in the biplot space. The individual sample points are
printed as solid squares. We have coloured (in the top left biplot) the squares according
to gender: red squares denoting the females and green squares the males. However, the
output of our function Genbipl provides all the necessary information for easily obtain-
ing biplots with a different colouring scheme - for example, the different faculties or
different academic positions.
That the sample points, obtained by projection, do not necessarily fall within their
corresponding prediction regions, obtained by back-projection, is clear from Figure 9.9.
We could show this by colouring the category levels in Figure 9.9, but this would interfere
with the coloured prediction regions. Therefore we show the information numerically in
Table 9.2, where the entries in bold give the numbers of correct predictions while the plain
entries show the numbers of incorrect predictions. The proportion of correct predictions
is closely analogous to the predictivity measure (Section 3.3) for quantitative variables.
However, with categorical variables we have additional information giving the separate
contributions to predictivity of each category level. Thus in Table 9.2(a) R 1and R 4
are never predicted, R 2and R 5 are well predicted, while R 3 is poorly predicted. In
Table 9.2(b) both genders are well predicted. Table 9.2(c) never predicts F 3, F 4, F 6, F 7
and F 8, while F 1, F 2, F 5and F 9 are all rather poorly predicted.
=−
Search WWH ::




Custom Search