Information Technology Reference
In-Depth Information
Ta b l e 8 . 1
A simple data matrix for information on five categorical variables for seven
individuals.
Case
Sex
Hair Colour
Region
Work
Education
1 George
M
Brown
England
Manual
School
2 Alisdair
M
Dark
Scotland
Clerical
University
3 Jane
F
Brown
Scotland
Professional
University
4 Ivor
M
Grey
Wales
Professional
University
5 Myfanwy
F
Fair
Wales
Clerical
School
6 Harriet
F
Brown
England
Manual
School
7 Jeremy
M
Grey
England
Professional
Postgrad
Ta b l e 8 . 2
Recoding of Table 8.1 as an indicator matrix
G
.Here
G
1
has two levels
(M, F),
G
2
has four levels (B, D, F, G),
G
3
has three levels (E, S, W),
G
4
has three
levels (M, C, P) and
G
5
has three levels (S, U, P). The frequencies
1
L
1,
1
L
2,
1
L
3,
1
L
4
and
1
L
5
are given in the final row.
Case
Sex Hair Colour Region Work Education
MFBDFGESWMCPSUP
1G r e 1 01 0 00 100 1 0 010 0
2 Alisdair 1 0 0 1 0 0 0 1 0 0 1 0 0 1 0
3 e 0 11 0 00 010 0 0 101 0
4I r 1 00 0 01 001 0 0 101 0
5M f nwy0 10 0 10 001 0 1 010 0
6H rri t 0 11 0 00 100 1 0 010 0
7 remy 1 00 0 01 100 0 0 100 1
r i s4 33 1 12 322 2 2 333 1
indicator matrices for all categorical variables to give
G
=
...
,
G
p
]:
n
×
L
,
w e
L
=
L
1
+
L
2
+
...
+
L
p
.
[
G
1
,
G
2
,
G
3
,
Table 8.2 shows Table 8.1 coded as an indicator matrix. Thus
G
, consisting entirely
of 0s and 1s, is the categorical equivalent of the quantitative data matrix
X
of PCA.
Because every categorical variable has one level for every sample, we have that the rows
of
G
all sum to
p
. Further, the column sums give the frequencies of all the category
levels assumed to be held in an
L
×
L
diagonal matrix
L
=
diag
(
diag
(
L
1
)
,diag
(
L
2
)
,
...
,
p
1
,
1
G
1
L
and
1
L1
diag
(
L
p
))
. Hence,
G1
=
=
=
np
.
8.2 Multiple correspondence analysis of the indicator matrix
One way of generalizing CA is to treat the categorical data matrix
G
as if it were a
two-way contingency table. This compares with the CA of chi-squared distance where
we saw in Chapter 7 that the two-way contingency table is sometimes treated as if it were
a data matrix where either the rows or the columns are treated as if they were variables.