Information Technology Reference
In-Depth Information
2
1
2
Ê
Ë
ˆ
¯
2
2
()() =
() -
()
vz
z
wz
exp
z
2
F
z
(9)
p
and
2
1
2
Ê
Á
ˆ
˜
(
) () -
() =+
wz
z
3
z
F
z
z
2
exp
-
z
2
.
p
In particular, n is continuous. Intuition suggests that n be positive. This
fact will not be needed here, but it is true: see Diaconis and Freedman
(1982, (3.15)-(3.16)).
Proposition. Assume (1) and (2). In probability: q n ,a / n Æ ar and R n ,a Æ
g (l) and
Æ () - ()
-
g
l
a
1
g
lr
ar
F
.
(10)
n ,
a
1
In the second regression, the t statistic for testing whether a coefficient
vanishes is asymptotically distributed as
1
-
- () .
ar
lr
Z
l
1
g
These results may be interpreted as follows. The number of variables
in the first-pass regression is p = r n + o ( n ); the number in the second
pass is q n ,a = ar n + o ( n ). That is, as may be expected, a of the variables
are significant at level a. Since g (l) < 1, the R 2 in the second-pass
regression is essentially the fraction g (l) of R 2 in the first pass. Likewise,
g (l) > a, so the asymptotic value of the F statistic exceeds 1. Since the
number of degrees of freedom is growing, off-scale P values will result.
Finally, the real level of the t test may differ appreciably from the nominal
level.
Example. Suppose N = 100 and p = 50, so r =
2 ; and a = 0.25 so l
1.15. Then g (l)
2.9. In a regression with 50
explanatory variables and 100 data points, on the null hypothesis R 2
should be nearly
0.72, and E { Z 2 | | Z | > l}
2 .
Next, run the regression again, keeping only the variables significant at
the 25 percent level. The new R 2 should be around g (l) = 72 percent of
the original R 2 . The new F statistic should be around
Search WWH ::




Custom Search