Statistical Methods - Text Mining Techniques for Healthcare Provider Quality Determination

Information Technology Reference

In-Depth Information

PROC SURVEYSELECT DATA=nis.nis_with_diabetescode;

OUT=NIS.DIABETESSAMPLE50

METHOD=SRS

N= 50

NOPRINT

;

STRATA diabetes;

ID LOS TOTCHG diabetes; RUN ;

We can modify the above SAS code for different sample sizes. We use this code to generate a sample

of size 200. The t-test remains not significant, but the confidence interval is considerably smaller at

(-1.298, 0.5078) and (-9344, 2581). At n=1000, the confidence width shrinks even more to (-0.618,0.018)

and (-3542,55.64). When n increases to 10,000, the p-values now become highly statistically significant

with intervals (-0.579, -0.400) and (-4402, -3453). In other words, the effect size for length of stay is

less than 0.15 of a day; the effect size for cost is approximately $500. If the sample size is increased any

more, the effect size will be smaller still. It is already so small, that while it has statistical significance,

it has no real practical importance. In fact, if we used the complete data sample, the confidence intervals

shrink to (-0.443, -0.429) and (-3783, -3702) for a statistically significant difference of $80.

tHe central lImIt tHeorem and tHe assumPtIon of normalIty

Regression requires the assumption that the residuals are normally distributed. However, most healthcare

data are exponential or gamma because of the presence of extreme outliers. The mean of a distribution is

highly susceptible to the existence of outliers. Usually, it is better to truncate outliers, to use nonparametric

tests based upon the median, or to use a model that accepts a skewed distribution. However, nonparametric

tests still require symmetry in the distribution and also have difficulty with skewed populations.

Linear regression requires moderately large samples to be effective. Power analysis tends to assume

that the population distribution is sufficiently homogeneous to be normally distributed. As healthcare

outcomes tend to be exponential or gamma distributions because the populations in outcomes research

are heterogeneous, we must consider just how large n has to be before the Central Limit Theorem is

realistic.(Battioui, 2007b) To examine the issue, we take samples of different sizes to compute the

distribution of the sample mean. The following code will compute 100 mean values from sample sizes

starting with 5 and increasing to 10,000.

PROC SURVEYSELECT DATA=nis.nis_205 OUT=work.samples METHOD=SRS N=5 rep=100 noprint;

RUN;

proc means data=work.samples noprint;

by replicate;

var los;

output out=out mean=mean;

run;

Search WWH ::

Custom Search

Home