Information Technology Reference
In-Depth Information
the testing of multiple hypotheses, a process that typifies the method of
stepwise regression, can only exacerbate the effects of spurious correlation.
As he notes in the introduction to the article, “If the number of variables
is comparable to the number of data points, and if the variables are only
imperfectly correlated among themselves, then a very modest search pro-
cedure will produce an equation with a relatively small number of explana-
tory variables, most of which come in with significant coefficients, and a
highly significant R 2 . This will be so even if Y is totally unrelated to the
X's”
Freedman used computer simulation to generate 5100 independent nor-
mally distributed “observations.” He put these values into a data matrix in
the form required by the SAS regression procedure. His organization of
the values defined 100 “observations” on each of 51 random variables.
Arbitrarily, the first 50 variables were designated as “explanatory” and the
51st as the dependent variable Y.
In the first of two passes through the “data,” all 50 of the explanatory
variables were used. 15 coefficients out of the 50 were significant at the
25% level, and one out of the 50 was significant at the 5% level.
Focusing attention on the “explanatory” variables that proved significant
on the first pass, a second model was constructed using only those 15
variables. The resulting model had an R 2 of 0.36 and the model coeffi-
cients of six of the “explanatory” (but completely unrelated) variables
were significant at the 5% level. Given these findings, how can we be sure
if the statistically significant variables we uncover in our own research via
regression methods are truly explanatory or are merely the result of
chance?
A partial answer may be found in an article by Gail Gong published in
1986 and reproduced in its entirety in Appendix 2.
Gail Gong was among the first, if not the first, student to have the
bootstrap as the basis of her doctoral dissertation. Reading her article,
reprinted here with the permission of the American Statistical Association,
we learn the bootstrap can be an invaluable tool for model validation, a
result we explore at greater length in the following chapter. We also learn
not to take for granted the results of a stepwise regression.
Gong [1986] constructed a logistic regression model based on
observations Peter Gregory made on 155 chronic hepatitis patients, 33
of whom died. The object of the model was to identify patients at high
risk. In contrast to the computer simulations David Freedman performed,
the 19 explanatory variables were real, not simulated, derived from
medical histories, physical examinations, X-rays, liver function tests, and
biopsies.
If one or more extreme values can influence the slope and intercept
of a univariate regression line, think how much more impact, and how
Search WWH ::




Custom Search