Information Technology Reference
In-Depth Information
PROC SURVEYSELECT DATA=nis.nis_with_diabetescode;
OUT=NIS.DIABETESSAMPLE50
METHOD=SRS
N= 50
NOPRINT
;
STRATA diabetes;
ID LOS TOTCHG diabetes; RUN ;
We can modify the above SAS code for different sample sizes. We use this code to generate a sample
of size 200. The t-test remains not significant, but the confidence interval is considerably smaller at
(-1.298, 0.5078) and (-9344, 2581). At n=1000, the confidence width shrinks even more to (-0.618,0.018)
and (-3542,55.64). When n increases to 10,000, the p-values now become highly statistically significant
with intervals (-0.579, -0.400) and (-4402, -3453). In other words, the effect size for length of stay is
less than 0.15 of a day; the effect size for cost is approximately $500. If the sample size is increased any
more, the effect size will be smaller still. It is already so small, that while it has statistical significance,
it has no real practical importance. In fact, if we used the complete data sample, the confidence intervals
shrink to (-0.443, -0.429) and (-3783, -3702) for a statistically significant difference of $80.
tHe central lImIt tHeorem and tHe assumPtIon of normalIty
Regression requires the assumption that the residuals are normally distributed. However, most healthcare
data are exponential or gamma because of the presence of extreme outliers. The mean of a distribution is
highly susceptible to the existence of outliers. Usually, it is better to truncate outliers, to use nonparametric
tests based upon the median, or to use a model that accepts a skewed distribution. However, nonparametric
tests still require symmetry in the distribution and also have difficulty with skewed populations.
Linear regression requires moderately large samples to be effective. Power analysis tends to assume
that the population distribution is sufficiently homogeneous to be normally distributed. As healthcare
outcomes tend to be exponential or gamma distributions because the populations in outcomes research
are heterogeneous, we must consider just how large n has to be before the Central Limit Theorem is
realistic.(Battioui, 2007b) To examine the issue, we take samples of different sizes to compute the
distribution of the sample mean. The following code will compute 100 mean values from sample sizes
starting with 5 and increasing to 10,000.
PROC SURVEYSELECT DATA=nis.nis_205 OUT=work.samples METHOD=SRS N=5 rep=100 noprint;
RUN;
proc means data=work.samples noprint;
by replicate;
var los;
output out=out mean=mean;
run;
Search WWH ::




Custom Search