Information Technology Reference
In-Depth Information
the reader is warned that it is not always easy to decide what variant is being offered
in the wide range of available software. We hope that after reading this chapter readers
will be in a position to decide the CA method and its biplot representation that best suit
their purposes.
Before discussing CA biplots, we note that a data set may also consist of the
different category levels of a single categorical variable. These category levels may
also to be regarded as the values of a dependent or response variable depending on a
single independent categorical variable. This situation calls for an optimal scoring pro-
cedure where the categories are replaced by optimal scores. Optimal scoring will be
discussed in the latter part of Chapter 8. The latter situation can also be handled by
multiple correspondence analysis (MCA) with three variables - that labelling the rows,
that labelling the columns, and the categorical variable in the body of the table - but
then the dependency relationship is ignored. MCA is discussed in the first part of
Chapter 8.
7.2 The correspondence analysis biplot
Correspondence analysis (Benzecri, 1973; Greenacre, 2007) analyses the association in
a p × q two-way contingency table X . Just as the biadditive model is concerned with
deviations from main effects, so CA is concerned with deviations from independence.
Although CA is ideally performed on a contingency table, computationally, the elements
of X may contain any nonnegative values; indeed only the row and column totals need
be positive. Notice also that we are assuming a specific model for CA, namely, the
independence model .
We begin with some notation. Let R and C denote the diagonal matrices containing
the row and column sums 1 X and X1 , respectively, of X , treated as diagonal elements.
The total sum of X is denoted by n = 1 X1 = 1 R1 = 1 C1 . With this notation, the
independence model, stating that the row classification of X is independent of its col-
umn classification, is given by E = R11 C / n . The CA procedure may be expressed in
a variety of closely related variants. Greenacre (1984, 2007) and Le Roux and Rouanet
(2004), in common with others, work in terms of frequencies and therefore initially
divide X by n ; this has no material effect and is ignored in the following. Here we
cannot give an exhaustive treatment, but content ourselves with discussing some of
the main variants of CA, especially with regard to biplot interpretation. Central to
all these variants is the approximation to the deviations X - E from the independence
model:
X E = X R11 C / n .
(7.1)
7.2.1 Approximation to Pearson's chi-squared
A simple possibility is to base biplots directly on the singular value decomposition of
the deviations X - E from the independence model, but it turns out that the weighted
deviations
R 1 / 2
( X E ) C 1 / 2
(7.2)
Search WWH ::




Custom Search