Validation - Common Errors in Statistics

Information Technology Reference

In-Depth Information

(a) The predictor variables to be employed are specified beforehand

(that is, we do not use the information in the sample to select

them).

(b) The coefficient estimates obtained from a calibration sample

drawn from a certain population are to be applied to other

members of the same population.

The proportion to be set aside for validation purposes will depend upon

the loss function. If both the goodness-of-fit error in the calibration

sample and the prediction error in the validation sample are based on

mean-squared error, Picard and Berk [1990] report that we can minimize

their sum by using between one-fourth and one-third of the sample for

validation purposes.

A compromise proposed by Moiser [1951] is worth revisiting: The orig-

inal sample is split in half; regression variables and coefficients are selected

independently for each of the subsamples; if they are more or less in

agreement, then the two samples should be combined and the coefficients

recalculated with greater precision.

A further proposal by Subrahmanyam [1972] to use weighted averages

where there are differences strikes us as equivalent to painting over cracks

left by the last earthquake. Such differences are a signal to probe deeper,

to look into causal mechanisms, and to isolate influential observations that

may, for reasons that need to be explored, be marching to a different

drummer.

Resampling

We saw in the report of Gail Gong [1986], reproduced in Appendix B, that

resampling methods such as the bootstrap may be used to validate our

choice of variables to include in the model. As seen in last chapter, they may

also be used to estimate the precision of our estimates.

But if we are to extrapolate successfully from our original sample to the

population at large, then our original sample must bear a strong resem-

blance to that population. When only a single predictor variable is involved,

a sample of 25 to 100 observations may suffice. But when we work with n

variables simultaneously, sample sizes on the order of 25 n to 100 n may be

required to adequately represent the full n -dimensional region.

Because of dependencies among the predictors, we can probably get by

with several orders of magnitude fewer data points. But the fact remains

that the sample size required for confidence in our validated predictions

grows exponentially with the number of variables.

Five resampling techniques are in general use:

1. K -fold, in which we subdivide the data into K roughly equal-sized

parts, then repeat the modeling process K times, leaving one

section out each time for validation purposes.

Search WWH ::

Custom Search

Home