Constructing QSARs for Metal Ions - Fundamental QSARs for Metal Ions

Chemistry Reference

In-Depth Information

F statistics (Draper and Smith 1988). The computed statistic is compared to some

arbitrary threshold for model predictive usefulness. Alone, a statistically significant

F statistic estimated for a regression model parameter provides only limited insight

about how useful any predictions from a model might be; however, some F statistic

value greater by a preestablished magnitude from an F df

mr ( ) has served as a tool

for separating useful from nonuseful models. The general rule of thumb, (F statistic)/

F df

,df,0.95

(

)

( ) ≥ 4 to 5 is often applied for this purpose as detailed in Draper and

Smith (1988). * For example, the estimated F statistic for the above model predicting

bacterial bioluminescence inhibition based on the metal ions' softness (σ con ), ionic,

and covalence indices was 52.83 and had an associated critical F (3,16,0.95) of 3.24. The

resulting (F statistic)/(F (3,16,0.95) ) = 52.83/3.24 = 16.31 is much greater than 5, suggest-

ing that the model would be a useful one for prediction.

Two cross-validation methods provide a better approach than that just described

but generally not as good as the validation method. The first involves splitting of the

available data into two subsets and the second involves removal of one datum at a

time from the data set prior to building models.

If large enough, a data set can be split into two subsets called the training and val-

idation sets. This procedure simulates the validation technique by producing a data

set not used to build the original model. The disadvantage of this approach is that

all available data are not used to generate the model. As a general rule, the number

of observations should be at least 6 to 10 times more than the number of explanatory

variables in order to successfully apply this approach (Neter et al. 1990). Individual

observations can be randomly split between the training and validation sets, but in

some cases, a completely random assignment might not be the best approach. For

example, it might be preferable to randomly pick observations from within regions

along a gradient for some explanatory variable. This ensures that both the train-

ing and validation data sets will have observations representing all relevant regions

along the gradient.

If the data set ( n ) is small, one observation can be removed at a time from the

data set to produce a data set of size n − 1, a model is generated with the n − 1

observations, predictions made with the model for that one removed observation,

and the difference between the observed value and predicted value calculated. The

removed datum is then placed back into the data set and another datum removed

and the above process repeated. This process is repeated for a data set to build

n models. Each model has a different observation missing for which predictions

are done. Analysis of the n differences between the observed and predicted val-

ues (prediction residuals) suggests how useful predictions will be from a model.

Prediction residuals can be examined directly or some summary statistic might be

generated from the prediction residuals. The following SAS code generates indi-

vidual prediction residuals and also produces a summary statistic for the bacterial

bioluminescence data set listed in Appendix 8.1. Figure 8.3 suggests good predic-

tion (top panel) and no apparent trend in prediction residuals with predicted ln of

the EC 50 (bottom panel).

,df,0.95

mr

* The df m = model degrees of freedom or the number of estimated parameters minus 1, df r = the residu-

als degrees of freedom, and 0.95 = 1-α.

Fundamental QSARs for Metal Ions

Search WWH ::

Custom Search

Home