Information Technology Reference
In-Depth Information
(a) The predictor variables to be employed are specified beforehand
(that is, we do not use the information in the sample to select
them).
(b) The coefficient estimates obtained from a calibration sample
drawn from a certain population are to be applied to other
members of the same population.
The proportion to be set aside for validation purposes will depend upon
the loss function. If both the goodness-of-fit error in the calibration
sample and the prediction error in the validation sample are based on
mean-squared error, Picard and Berk [1990] report that we can minimize
their sum by using between one-fourth and one-third of the sample for
validation purposes.
A compromise proposed by Moiser [1951] is worth revisiting: The orig-
inal sample is split in half; regression variables and coefficients are selected
independently for each of the subsamples; if they are more or less in
agreement, then the two samples should be combined and the coefficients
recalculated with greater precision.
A further proposal by Subrahmanyam [1972] to use weighted averages
where there are differences strikes us as equivalent to painting over cracks
left by the last earthquake. Such differences are a signal to probe deeper,
to look into causal mechanisms, and to isolate influential observations that
may, for reasons that need to be explored, be marching to a different
drummer.
Resampling
We saw in the report of Gail Gong [1986], reproduced in Appendix B, that
resampling methods such as the bootstrap may be used to validate our
choice of variables to include in the model. As seen in last chapter, they may
also be used to estimate the precision of our estimates.
But if we are to extrapolate successfully from our original sample to the
population at large, then our original sample must bear a strong resem-
blance to that population. When only a single predictor variable is involved,
a sample of 25 to 100 observations may suffice. But when we work with n
variables simultaneously, sample sizes on the order of 25 n to 100 n may be
required to adequately represent the full n -dimensional region.
Because of dependencies among the predictors, we can probably get by
with several orders of magnitude fewer data points. But the fact remains
that the sample size required for confidence in our validated predictions
grows exponentially with the number of variables.
Five resampling techniques are in general use:
1. K -fold, in which we subdivide the data into K roughly equal-sized
parts, then repeat the modeling process K times, leaving one
section out each time for validation purposes.
Search WWH ::




Custom Search