Biology Reference
In-Depth Information
That is because we can control the rate of Type I error (i.e. falsely rejecting the null
hypothesis when it is true). Type I error rates are controlled by setting the alpha level of
the test and so statistical tests cannot be said to differ in their rates of Type I error. In con-
trast, they can differ in their rates of Type II error. The power of a statistical test is its ability
to distinguish between the false null hypothesis and the true alternative, and it is some-
times expressed as 1 minus the rate of Type II error.
Estimating the power of statistical tests turns out to be both difficult, and neglected by
many researchers. Some work indicates that permutation, bootstrap and analytic tests
have equivalent statistical power when the data meet the requirements of the analytic tests
( Hoeffding, 1952; Robinson, 1973; Romano, 1989; Manly, 1997 ). Edgington (1995) reports
higher statistical power for randomization tests when there are violations of the assump-
tions of the analytic statistical tests. Efron and Tibshirani (1993) present an approach to
estimating power, given a specific sample size. The approach offered by Sheets and
Mitchell (2001) is to use Monte Carlo methods to estimate the rates of Type II error under
several plausible alternatives to the null hypothesis. Despite the attendant difficulty in
estimating the statistical power of different tests, randomization-based tests seem to have
at least as much statistical power as the more familiar analytical tests.
How Many Repetitions?
Regardless of the method used, the researcher is always faced with the question of how
many replications or repetitions should be made. We want a small bias and standard devi-
ation, but it is not clear how many replications are required to achieve this end. The num-
ber of independent bootstrap samples that one may form out of N specimens is (2N
1)!/
2
N!(N
10 specimens. In most
cases, even thousands of bootstrap replicates will not come close to exhausting all possible
bootstrap sets. Typically, a modest subset of all possible sets is adequate for most statisti-
cal questions. Estimates of standard errors can usually be produced using only 100 or
fewer bootstrap sets ( Efron and Tibshirani, 1993 ), but reliable estimates of confidence inter-
vals may require using many more. It does not appear that there is complete consensus on
this issue (see Efron, 1992; Efron and Tibshirani, 1993; Jackson and Somers, 1989; Manly,
1997 ), but it does seem that more repetitions are necessary for estimating confidence inter-
vals in that we must estimate a specific percentile point value, than for hypothesis testing
(see Manly, 1997 ) or for estimating of standard errors ( Efron and Tibshirani, 1993 ). If com-
puter time is not an issue, a range of 1000 to 2000 bootstrap tests is recommended for esti-
mating a 95% confidence interval on a parameter ( Efron, 1987; Efron and Tibshirani, 1993 )
and, in light of the very fast computers now generally available, far more than these are
feasible. When the time necessary to complete a calculation is a factor, one approach is to
increase the sample size steadily until arriving at a value that is stable with respect to fur-
ther increases in sample size. The stability criterion is perhaps most applicable to hypothe-
sis testing, where we may not need to know the exact confidence level of the observed
statistic only that we can (or cannot) reject the null hypothesis at a 5% confidence level.
Using this sequential approach, we could, for example, run a bootstrap t-test and find
that in 100 bootstrap tests the difference in means exceeds the observed difference 40 times
(yielding p
2
1) ( Efron and Tibshirani, 1993 ), which is over 90 000 for N
5
0.40). It is probably safe to state that we cannot reject the null at a 5% confi-
dence level in light of that result. A repetition of the bootstrap procedure might yield a
5
Search WWH ::




Custom Search