Information Technology Reference
In-Depth Information
The first of these aims is appropriate if the loss function is mean-square
error. 4 The second creates a strong risk of overfitting. Validation is essen-
tial, yet most of the methods discussed in Chapter 11 do not apply. Vali-
dation via a completely independent data set cannot provide confirmation,
because the new data would entail the production of a completely differ-
ent, unrelated curve. The only effective method of validation is to divide
the data set in half at random, fit a curve to one of the halves, and then
assess its fit against the entire data set.
SUMMARY
Regression methods work well with physical models. The relevant variables
are known and so are the functional forms of the equations connecting
them. Measurement can be done to high precision, and much is known
about the nature of the errors—in the measurements and in the equations.
Furthermore, there is ample opportunity for comparing predictions
to reality.
Regression methods can be less successful for biological and social
science applications. Before undertaking a univariate regression, you
should have a fairly clear idea of the mechanistic nature of the relationship
(and thus the form the regression function will take). Look for deviations
from the model particularly at the extremes of the variable range. A plot
of the residuals can be helpful in this regard; see, for example, Davison
and Snell [1991] and Hardin and Hilbe [2003, pp. 143-159].
A preliminary multivariate analysis (the topic of the next two chapters)
will give you a fairly clear notion of which variables are likely to be con-
founded so that you can correct for them by stratification. Stratification
will also allow you to take advantage of permutation methods that are to
be preferred in instances where “errors” or model residuals are unlikely
to follow a normal distribution.
It's also essential that you have firmly in mind the objectives of your
analysis, and the losses associated with potential decisions, so that you can
adopt the appropriate method of goodness of fit. The results of a regres-
sion analysis should be treated with care; as Freedman [1999] notes,
“Even if significance can be determined and the null hypothesis rejected or
accepted, there is a much deeper problem. To make causal inferences, it
must in essence be assumed that equations are invariant under proposed
interventions. . . . if the coefficients and error terms change when the
variables on the right hand side of the equation are manipulated rather
than being passively observed, then the equation has only a limited utility
for predicting the results of interventions.”
4
Most published methods also require that the residuals be normally distributed.
Search WWH ::




Custom Search