Information Technology Reference
In-Depth Information
a coordinate axis requires modification. Quantitative variables define a continuum of
values represented by continuous, often linear, axes. Categorical variables take a finite
number of levels L k represented by L k points, termed the category-level points. Just as a
sample with a value x k is closer to the marker for x k on the k th axis, so must a sample
with a particular category level be closest to the CLP for that category level. For the
EMC it works out that the CLPs for all L category levels are given by the rows of the
unit matrix I . For MCA the CLPs are at the points given by the rows of the matrix
p 1 / 2 L 1 / 2 ; for CVA the CLPs are the coordinates of the canonical means.
Thus, the CLPs act like coordinated axes, in that every sample is placed at the vector-
sum of its associated CLPs and is nearest the p correct CLPs, one from each set, for its
particular categories. Thus every CLP defines a set of nearest-neighbour regions. This
causes some problems for representing categories in low-dimensional representations. For
continuous variables we have seen that the concept of back-projection (Section 5.4.2.3)
allows markers to be placed in the approximation that are nearest the true markers that
inhabit some high-dimensional space, which in turn justifies constructing biplot axes. For
CLPs the position is similar but there is a whole convex region in the approximation
space that is nearer each CLP than the others. These regions are called prediction regions
because any point in a prediction region will be predicted to have that same associated
category level. The k th categorical variable will have L k such prediction regions. This
is shown in Figures 8.9 (MCA) and 8.10 (EMC).for the four variables of Table 8.1 The
prediction regions for the two methods are similar, but this is not easy to see because the
maps are differently oriented. One problem with this representation is that each categor-
ical variable occupies the whole of the space of the display, not just a single biplot axis.
This makes it impracticable, though not impossible, to represent the prediction regions
for more than one variable on a single diagram. Gower and Hand (1996) and Gower
(1993) sketch how a sophisticated algorithm might be developed for calculating the pre-
diction regions. A simple alternative is to cover the whole approximation with pixels and
associate a different colour with each pixel, to indicate the nearest CLP; this is what we
have done in our examples. The pixel colouring procedure is implemented in our func-
tion pred.regions that is called by MCAbipl on setting argument pred.regions =
TRUE . Note that the CLPs are not at the centroids of their nearest-neighbour regions,
so neither are their projections. Indeed, the projections need not even lie within their
predictions regions and some category levels may not have a prediction region in the
approximation space, because it is hidden behind other prediction regions.
8.6 Homogeneity analysis
The methods discussed in this section are often described as optimal scores methods.
Their aim is to replace the nominal category levels by numerical optimal scores. In
homogeneity analysis (see Gifi, 1990) we seek scores z = ( z 1 , z 2 , ... , z p ) , often termed
quantifications , that replace G by Gz . The criterion chosen is to minimize the dispersion
within the rows; hence the reference to homogeneity. This is the same as maximizing
dispersion between rows. If the scaling of z is left uncontrolled there is a trivial solution
Search WWH ::




Custom Search