Information Technology Reference
In-Depth Information
because none of the preprocessing steps are clearly identified. The programming code necessary for the
extraction is also not provided. Consider a recent study on heart failure. The total number of subjects
was 278,214. The number was reduced to 70,571 subjects for a logistic regression to test the relation-
ship between length of stay and treatment. With such a large sample size, the independent variables will
be statistically significant with an effect size of almost zero. Other studies have similar problems with
large sample sizes and all independent variables statistically significant. (Delaney, Chang, Senagore,
& Broder, 2008)
In a study of length of stay for the treatment of lung cancer, the sample size was 4979, but the treat-
ment under consideration was performed in 351 patients (7%), indicating a rare occurrence.(Wright et
al., 2008) The study did not adjust for the rare occurrence, nor did it report on the difference in the false
positive versus the false negative rate. Therefore, it is doubtful if the study has any real predictive capa-
bilities. Any model that predicts 100% as non-occurrences will be 93% accurate, so a good prediction
model would have to be more accurate than 93%. Another aspect of predictive modeling is that multiple
models are used and compared, defining a holdout sample (or minimizing costs) to find the optimal
choice. In traditional statistics, one model is chosen and used without any attempt to validate the model
choice, or to compare to other models. (Odueyungbo, Browne, Akhtar-Danesh, & Thabane, 2008)
The Society of Thoracic Surgeons has an excellent database repository; however, the statistical
methods used need to include predictive modeling rather than to just rely on logistic regression. In
particular, a study of atrial fibrillation from this dataset used dozens of variables. (Gammie et al., 2008)
Predictive modeling techniques are available to reduce the number of variables and to avoid the hazard
of over-fitting the model. However, these techniques are not commonly employed in medical research
studies. One such study started with 708,593 patients but did not define a holdout sample to validate
the results, nor did it compensate for the rare occurrence.(Mehta et al., 2008) One of the problems with
the analyses is that the database remains proprietary to a select group of investigators, similar to the
case with the SEARCH database, so there is no independent examination of the data or of the results.
(Boffa et al., 2008)
One of the great advantages of using these large databases is that it is possible to examine long-term
consequences of treatment for chronic illnesses. (Raaijmakers et al., 2008) It is also possible to inves-
tigate treatment decisions in relationship to patient demographics, and to investigate the possibility in
disparities in treatment choices by gender. (Aron, Nguyen, Stein, & Gill, 2008; Cho, Hoogwere, Huang,
Brennan, & Hazen, 2008), race or ethnicity, and by socio-economic status. It is also possible to use the
entire database and to use variable reduction techniques that are part of the data mining process rather
than to continue to use inclusion/exclusion criteria.(Hauptman, Swindle, burroughts, & Schnitzler, 2008)
These criteria can overlook important confounding factors that are always important to investigate in
observational studies. Dividing a study population on just one factor tends to neglect any examination of
confounders.(Good, Holschub, Albertson, & Eldridge, 2008) In this study by Good et.al., the determining
factor was grain consumption, completely overlooking the fact that there are many factors involved in
nutrition and diet, and they are all highly dependent on each other.(Good et al., 2008)
Occasionally, some studies use some data mining techniques, particularly the technique of separating
a holdout sample from the data used to define the model.(Moran, Bristow, Solomon, George, & Hart,
2008) One such study conducted in Australia examined mortality in intensive care (ICU). In addition,
by using receiver operating curves (ROC), the study examines the difference between false positives
and false negatives. However, this study still does not consider the fact that mortality remains a fairly
rare occurrence and the group sizes between mortality and non-mortality are quite different, influencing
model choice and results.
Search WWH ::




Custom Search