Information Technology Reference
In-Depth Information
VERIFY THE DATA
The first step in any analysis is to verify that the data have been entered
correctly. As noted in Chapter 3, GIGO. A short time ago, a junior biosta-
tistician came into my office asking for help with covariate adjustments for
race. “The data for race doesn't make sense,” she said. Indeed the propor-
tions of the various races did seem incorrect. No “adjustment” could be
made. Nor was there any reason to believe that race was the only variable
affected. The first and only solution was to do a thorough examination of
the database and, where necessary, trace the data back to its origins until
all the bad data had been replaced with good.
The SAS programmer's best analysis tool is PROC MEANS. By merely
examining the maximum and minimum values of all variables, it often is
possible to detect data that were entered in error. Some years ago, I
found that the minimum value of one essential variable was zero. I brought
this to the attention of a domain expert who told me that a zero was
impossible. As it turns out, the data were full of zeros, the explanation
being that the executive in charge had been faking results. Of the 150
subjects in the database, only 50 were real.
Before you begin any analysis, verify that the data have been entered
correctly.
COMPARING MEANS OF TWO POPULATIONS
The most common test for comparing the means of two populations is
based upon Student's t . For Student's t test to provide significance levels
that are exact rather than approximate, all the observations must be inde-
pendent and, under the null hypothesis, all the observations must come
from identical normal distributions.
Even if the distribution is not normal, the significance level of the t test
is almost exact for sample sizes greater than 12; for most of the distribu-
tions one encounters in practice, 2 the significance level of the t test is
usually within a percent or so of the correct value for sample sizes between
6 and 12.
There are more powerful tests than the t test for testing against non-
normal alternatives. For example, a permutation test replacing the original
observations with their normal scores is more powerful than the t test
(Lehmann and D'Abrera, 1988).
Permutation tests are derived by looking at the distribution of values
the test statistic would take for each of the possible assignments of treat-
ments to subjects. For example, if in an experiment two treatments were
2 Here and throughout this text, we deliberately ignore the many exceptional cases (to the
delight of the true mathematician) that one is unlikely to encounter in the real world.
Search WWH ::




Custom Search