Chemistry Reference
In-Depth Information
The method of least squares may be used to estimate the regression coefficients ( k 1 - k 5 ) for the independent variables
[(log P ) 2 , log P , , and E S ), and the value of the constant term ( k 5 ) are determined by computer to obtain the best
fitting line. The accuracy of fit of the equation to the data can be obtained by calculation of the correlation
coefficient ( r 2 ). If the regression equation is a perfect fit to the data ( r 2 = 1), then a plot of the predicted versus
observed should give a straight line with a slope of one and intercept of zero.
A guide to the overall significance of a regression model can be obtained by calculation of a quantity called the F
statistic. An F statistic is used by looking up a standard value for F from a table of F statistics and comparing the
calculated value with the tabulated one. If the calculated value is greater than the tabulated value, the equation is
significant at that particular confidence level. Only the F values larger than the 95% significance limits prove the
overall significance of a regression equation. As might be expected, the F values are greater for higher levels of
significance.
F k, n - k - 1 = r 2 · (n - k - 1)/k· (1 - r 2 )
(13)
k is the number of independent variables in the equation, n is the number of data points. An F statistic is usually
quoted as F k, n-k-1 .
The significance of the individual terms is assessed by calculation of the standard error of the regression
coefficients, a measure of how much of the dependent variable prediction is contributed by that term. A statistic, the
t statistic, may be calculated for each regression coefficient by division of the coefficient by its standard error (SE):
t = │ b /SE of b t has to be taken as absolute value.
(14)
Like the F statistic, the significance of t statistics is assessed by looking up a standard value in a table; the calculated
value should exceed the tabulated one. As a rule of thumb, the regression coefficient should be at least twice as large
as its standard error if it is to be considered significant. Several t - and F -distribution tables can be found [17].
Another useful statistics that can be calculated to characterize the fit of a regression model to a set of data is the
standard error of prediction. This gives a measure of how well one might expect to be able to make individual
predictions. The model will be useful if the prediction standard error is less than ten per cent of the range of
measurements. The probability level is invariably 0.05 for QSAR.
Modern validation techniques are called bootstrapping or cross-validation. Cross-validation (CV) evaluates a model
not by how well it fits the data but by how well it predicts data. The data set consisting of n compounds is divided
into groups. Leaving out one group according to a fixed or random pattern, the multiple linear regression for the
reduced set of data is recalculated and the missing values are predicted. This is repeated until every compound is left
out once and only once. When each time only one compound is left out, this is referred to as the leave-one-out
(LOO) method. A recommendation is to divide the data set into seven groups. Using the predicted values the PRESS
(predictive residual sum of squares) and SD values are obtained as:
PRESS = ∑ (property observed - property predicted ) 2
(15)
SD = ∑ (property observed - property mean ) 2
(16)
and the cross-validated correlation coefficient is calculated as:
q 2 = r 2 CV = (SD - PRESS)/SD
(17)
q 2 will always be smaller than r 2 . When q 2 > 0.3, a model is considered significant [18].
To avoid statistically non-significant relationships or chance correlations, one should always apply the following
rules of thumb [18]:
Search WWH ::




Custom Search