Information Technology Reference
In-Depth Information
where
q
p
n
=
x
ij
,
j
=
1
i
=
1
and, using the weighted identity on the expression in square brackets of (7.23), we have
p
2
q
1
x
.
j
2
χ
=
n
x
i
.
x
i
.
(
x
ij
/
x
i
.
−
x
i
j
/
x
i
.
)
i
j
=
1
i
<
p
q
1
x
.
j
(
x
ij
/
x
i
.
−
x
i
j
/
x
i
.
)
2
.
=
n
x
i
.
x
i
.
(7.24)
i
i
<
j
=
1
The expression in the square brackets on the right-hand side of (7.24) is the chi-squared
distance (7.12) between the
i
th and
i
th rows of
X
. Thus we have the simple result that
p
n
2
x
i
.
x
i
.
d
ii
=
2
1
RDR1
,
χ
=
n
(7.25)
i
i
<
where
D
={
d
ii
}
is the
p
×
p
matrix of all the row chi-squared distances (7.12). Similarly,
for the column chi-squared distances, we have
q
n
2
x
.
j
x
.
j
d
jj
=
2
1
CDC1
,
χ
=
n
j
<
j
d
jj
}
where now
D
q
matrix of all the column chi-squared distances (7.18).
These results link the chi-squared distances to the total Pearson's
χ
={
is the
q
×
2
for
X
.
7.2.5 Canonical correlation approximation
Probably the oldest derivation of CA is due to Hirschfeld (1935) who asked what quan-
tification of the categorical levels of the two variables classifying the contingency table,
maximized their correlation. To express this idea algebraically, we define two indicator
matrices,
G
1
and
G
2
, of sizes
n
×
p
and
n
×
q
respectively, identifying row and column
membership of the
n
cases.
In terms of our previous notation, we have
X
=
G
1
G
2
,
R
=
G
1
G
1
,
C
=
G
2
G
2
.
Next, we define quantification vectors
z
1
:
p
×
1and
z
2
:
q
×
1, to be determined, which
transform the categorical variables into quantitative variables
G
1
z
1
and
G
2
z
2
.Thesetwo
variables have squared (uncentred) correlations
ρ
2
given by
2
(
z
1
G
1
G
1
z
1
)(
z
2
G
2
G
2
z
2
)
=
(
z
1
G
1
G
2
z
2
)
2
(
z
1
Rz
1
)(
z
2
Cz
2
)
.
(
z
1
Xz
2
)
2
ρ
=
(7.26)