Multiple correspondence analysis - Understanding Biplots

Information Technology Reference

In-Depth Information

a coordinate axis requires modification. Quantitative variables define a continuum of

values represented by continuous, often linear, axes. Categorical variables take a finite

number of levels L k represented by L k points, termed the category-level points. Just as a

sample with a value x k is closer to the marker for x k on the k th axis, so must a sample

with a particular category level be closest to the CLP for that category level. For the

EMC it works out that the CLPs for all L category levels are given by the rows of the

unit matrix I . For MCA the CLPs are at the points given by the rows of the matrix

p − 1 / 2 L − 1 / 2 ; for CVA the CLPs are the coordinates of the canonical means.

Thus, the CLPs act like coordinated axes, in that every sample is placed at the vector-

sum of its associated CLPs and is nearest the p correct CLPs, one from each set, for its

particular categories. Thus every CLP defines a set of nearest-neighbour regions. This

causes some problems for representing categories in low-dimensional representations. For

continuous variables we have seen that the concept of back-projection (Section 5.4.2.3)

allows markers to be placed in the approximation that are nearest the true markers that

inhabit some high-dimensional space, which in turn justifies constructing biplot axes. For

CLPs the position is similar but there is a whole convex region in the approximation

space that is nearer each CLP than the others. These regions are called prediction regions

because any point in a prediction region will be predicted to have that same associated

category level. The k th categorical variable will have L k such prediction regions. This

is shown in Figures 8.9 (MCA) and 8.10 (EMC).for the four variables of Table 8.1 The

prediction regions for the two methods are similar, but this is not easy to see because the

maps are differently oriented. One problem with this representation is that each categor-

ical variable occupies the whole of the space of the display, not just a single biplot axis.

This makes it impracticable, though not impossible, to represent the prediction regions

for more than one variable on a single diagram. Gower and Hand (1996) and Gower

(1993) sketch how a sophisticated algorithm might be developed for calculating the pre-

diction regions. A simple alternative is to cover the whole approximation with pixels and

associate a different colour with each pixel, to indicate the nearest CLP; this is what we

have done in our examples. The pixel colouring procedure is implemented in our func-

tion pred.regions that is called by MCAbipl on setting argument pred.regions =

TRUE . Note that the CLPs are not at the centroids of their nearest-neighbour regions,

so neither are their projections. Indeed, the projections need not even lie within their

predictions regions and some category levels may not have a prediction region in the

approximation space, because it is hidden behind other prediction regions.

8.6 Homogeneity analysis

The methods discussed in this section are often described as optimal scores methods.

Their aim is to replace the nominal category levels by numerical optimal scores. In

homogeneity analysis (see Gifi, 1990) we seek scores z = ( z 1 , z 2 , ... , z p ) , often termed

quantifications , that replace G by Gz . The criterion chosen is to minimize the dispersion

within the rows; hence the reference to homogeneity. This is the same as maximizing

dispersion between rows. If the scaling of z is left uncontrolled there is a trivial solution

Search WWH ::

Custom Search

Home