Biology Reference
In-Depth Information
Y contains the deviations of each individual from the overall mean. B is the matrix of coef-
ficients of the model, which will be fitted to the data, X is the centered design matrix and
ε
is the matrix of residuals or error terms. If Y is a matrix with N rows (one per specimen)
and Q columns, the matrix of residuals,
ε
, will also have N rows and Q columns. The size
of the design matrix, X, C
N, depends on the design, i.e. on the number of factors, the
number of levels of each factor and the number of distinct combinations of factors in any
interaction terms, as well as the number of interaction with covariates. It can also depend
on how the model is coded because the number of columns, ignoring interaction terms for
the moment, could either equal G
3
1 (where G is the number of groups) or G. It takes
2
G
1 columns to specify the design, so using G columns makes the coding scheme redun-
dant (and the X matrix is then not invertible). We will therefore focus on design matrices
that have G
2
1 columns.
To understand the codes, it is important to remember that we are using regression to
analyze categorical factors. We therefore need values for the categorical factors that make
the results of the regression interpretable. One coding method is called “dummy coding”.
According to this method, all individuals are coded as either a zero or one to indicate each
individual's level on each categorical factor; including all interaction terms. Which group
is coded as zero or one is arbitrary, but the interpretation depends on the codes because
the intercept is the mean of the group coded as zero. Usually, the control group is the one
coded zero and the null hypothesis is that the means of the other groups do not differ
from the mean of the control group. The coefficients for the other groups give the devia-
tions from the control group mean. If there are three groups, it takes two columns to
encode a single factor; all individuals belonging to the first group will have ones in the
first column and zeros in the second, all individuals belonging to the second group will
have zeros in the first column and ones in the second and all individuals belonging to the
third group will have zeros in both columns. To obtain the codes for the interaction terms,
the columns of codes for the factors are multiplied by each other. Coding can become
complex when factors are nested, so the X matrices for these more complex designs are
discussed later in context of the more complex models.
An alternative coding method is called “effect coding”; according to this method, all
individuals are coded as negative one, zero, or positive one. If there are only two groups,
the first one is coded as
2
1, the other as 1, and if the design is balanced, the mean for the
column is zero. If there are three groups, the first is coded as
2
1, the last as 1, and the sec-
ond by 0; in the second column, the first group is coded as 0, the second as
2
1, and the
third as 1. Using this method, the intercept is the grand mean and X 1 is the deviation of
the first group from that mean, X 2 is the deviation of the second group from that same
mean, etc. So, when testing the statistical significance of the coefficients for X , we are test-
ing the null hypothesis that one group does not differ from the grand mean by more than
expected by chance. As mentioned above, the codes for interaction terms are obtained by
multiplying the columns for the interacting factors.
Coding is more difficult when the design is unbalanced for a reason that may become
obvious if you consider that the grand mean will not be zero when there are different
numbers of positive and negative ones. The codes will therefore have to be modified to
ensure that the grand mean is still zero and that the columns of X are mutually orthogo-
nal. One approach is to code the first group as (N
2
2
n i )/N where n i is the number of
Search WWH ::




Custom Search