Chemistry Reference
In-Depth Information
F statistics (Draper and Smith 1988). The computed statistic is compared to some
arbitrary threshold for model predictive usefulness. Alone, a statistically significant
F statistic estimated for a regression model parameter provides only limited insight
about how useful any predictions from a model might be; however, some F statistic
value greater by a preestablished magnitude from an F df
mr ( ) has served as a tool
for separating useful from nonuseful models. The general rule of thumb, (F statistic)/
F df
,df,0.95
(
)
( ) ≥ 4 to 5 is often applied for this purpose as detailed in Draper and
Smith (1988). * For example, the estimated F statistic for the above model predicting
bacterial bioluminescence inhibition based on the metal ions' softness (σ con ), ionic,
and covalence indices was 52.83 and had an associated critical F (3,16,0.95) of 3.24. The
resulting (F statistic)/(F (3,16,0.95) ) = 52.83/3.24 = 16.31 is much greater than 5, suggest-
ing that the model would be a useful one for prediction.
Two cross-validation methods provide a better approach than that just described
but generally not as good as the validation method. The first involves splitting of the
available data into two subsets and the second involves removal of one datum at a
time from the data set prior to building models.
If large enough, a data set can be split into two subsets called the training and val-
idation sets. This procedure simulates the validation technique by producing a data
set not used to build the original model. The disadvantage of this approach is that
all available data are not used to generate the model. As a general rule, the number
of observations should be at least 6 to 10 times more than the number of explanatory
variables in order to successfully apply this approach (Neter et al. 1990). Individual
observations can be randomly split between the training and validation sets, but in
some cases, a completely random assignment might not be the best approach. For
example, it might be preferable to randomly pick observations from within regions
along a gradient for some explanatory variable. This ensures that both the train-
ing and validation data sets will have observations representing all relevant regions
along the gradient.
If the data set ( n ) is small, one observation can be removed at a time from the
data set to produce a data set of size n − 1, a model is generated with the n − 1
observations, predictions made with the model for that one removed observation,
and the difference between the observed value and predicted value calculated. The
removed datum is then placed back into the data set and another datum removed
and the above process repeated. This process is repeated for a data set to build
n models. Each model has a different observation missing for which predictions
are done. Analysis of the n differences between the observed and predicted val-
ues (prediction residuals) suggests how useful predictions will be from a model.
Prediction residuals can be examined directly or some summary statistic might be
generated from the prediction residuals. The following SAS code generates indi-
vidual prediction residuals and also produces a summary statistic for the bacterial
bioluminescence data set listed in Appendix 8.1. Figure 8.3 suggests good predic-
tion (top panel) and no apparent trend in prediction residuals with predicted ln of
the EC 50 (bottom panel).
,df,0.95
mr
* The df m = model degrees of freedom or the number of estimated parameters minus 1, df r = the residu-
als degrees of freedom, and 0.95 = 1-α.
Search WWH ::




Custom Search