Chemistry Reference
In-Depth Information
of 0.87 indicates that the model accounts for approximately 87% of the variation in
the response variable, with only 13% of the variability remaining unexplained. The
residuals appear to be randomly distributed around the predicted values (bottom
panel). The residuals show no pattern if plotted against the explanatory variable with
a reference line indicating the state of perfect prediction at any point (residual = 0).
There was a random distribution of residuals with no obvious trends left unexplained
by the model along the range of values for the explanatory variable. This will be
explored more closely later. Further, there was no trend in the amount of variation
in the residuals along the abscissa, providing no reason to doubt the assumption of
homoscedasticity. In a normal probability plot (top of Figure  8.1 ) with the distri-
bution of points expected for a normal distribution (asterisks) and positions of the
residuals (+ signs), the residuals conform to the assumed normal distribution. Four
tests for normality of the regression residuals also provide no evidence of deviation
from this assumption (top right of Figure 8.1).
The next task required in QSAR development is selection of the best model.
Several approaches are used and range from statistically uninformed judgment of
the researcher, MAICE, and Mallows's C p method. Model selection guided solely
by the researcher's informal judgment can produce the best model, but consistency
of good judgment is enhanced by application of more formal methods. For example,
model selection from among candidate models based only on the smallest χ 2 value
will not always produce the best model,
(
)
2
ˆ
n
YY
i
i
2
χ=
.
(8.2)
ˆ
Y
i
i
=
1
Use of coefficients of determination or χ 2 values for two models of similar complex-
ity might be adequate if combined with subject knowledge and the residual plots just
described. As an example, such use would be appropriate if the above SAS code fitting
the [ln EC50i, 50i , σ con ] data were also modified to assess the alternative model generated
with the more conventional softness index, σ p . The r 2 for σ con was 0.87 and that for σ p
was 0.81, lending support to Kinraide's (2009) argument that σ con will perform better
than the conventional σ p during model generation. But the model with the most infor-
mation per fitted explanatory variable cannot be identified with these otherwise useful
goodness-of-fit statistics. The r 2 will increase with each addition of an explanatory
variable, but the incremental improvement in fit might carry the cost of increased vari-
ance in parameter estimates (Hocking 1976). A straightforward change can be made to
Equation (8.1) to generate an adjusted r 2 that incorporates the number of explanatory
parameters and model degrees of freedom (Hocking 1976; Walker et al. 2003),
(
)
(
=−
)
2
n
−−
11
r
np
2
(8.3)
r
1
Adjusted
where n = the number of observations, r 2 = coefficient of determination, and p = the
number of estimated parameters.
Search WWH ::




Custom Search