Information Technology Reference
In-Depth Information
simulation (Section 2) and through asymptotic calculation (Section 3). For
another discussion, see Rencher and Pun (1980).
To help draw the conclusion explicitly, suppose an investigator seeks to
predict a variable Y in terms of some large and indefinite list of explana-
tory variables X 1 , X 2 ,.... If the number of variables is comparable to the
number of data points, and if the variables are only imperfectly correlated
among themselves, then a very modest search procedure will produce an
equation with a relatively small number of explanatory variables, most of
which come in with significant coefficients, and a high significant R 2 . This
will be so even if Y is totally unrelated to the X 's.
To sum up, in a world with a large number of unrelated variables and
no clear a priori specifications, uncritical use of standard methods will lead
to models that appear to have a lot of explanatory power. That is the
main—and negative—message of the present note. Therefore, only the
null hypothesis is considered here, and only the case where the number of
variables is of the same order as the number of data points.
The present note is in the same spirit as the pretest literature. An early
reference is Olshen (1973). However, there is a real difference in imple-
mentation: Olshen conditions on an F test being significant; the present
note screens out the insignificant variables and refits the equation. Thus,
Olshen has only one equation to deal with; the present note has two. The
results of this note can also be differentiated from the theory of pretest
estimators described in, for example, Judge and Bock (1978). To use the
latter estimators, the investigator must decide a priori which coefficients
may be set to zero; here, this decision is made on the basis of the data.
2. A SIMULATION
A matrix was created with 100 rows (data points) and 51 columns (vari-
ables). All the entries in this matrix were independent observations drawn
from the standard normal distribution. In short, this matrix was pure
noise. The 51st column was taken as the dependent variable Y in a regres-
sion equation; the first 50 columns were taken as the independent vari-
ables X 1 ,..., X 50 . By construction, then, Y was independent of the X 's.
Ideally, R 2 should have been insignificant, by the standard F test. Likewise,
the regression coefficients should have been insignificant, by the standard
t test.
These data were analyzed in two successive multiple regressions. In the
first pass, Y was run on all 50 of the X 's, with the following results:
R 2 = 0.50, P = 0.53;
15 coefficients out of 50 were significant at the 25 percent level;
1 coefficient out of 50 was significant at the 5 percent level.
Search WWH ::




Custom Search