Introduction to Ranking Models - Text Mining Techniques for Healthcare Provider Quality Determination

Information Technology Reference

In-Depth Information

because none of the preprocessing steps are clearly identified. The programming code necessary for the

extraction is also not provided. Consider a recent study on heart failure. The total number of subjects

was 278,214. The number was reduced to 70,571 subjects for a logistic regression to test the relation-

ship between length of stay and treatment. With such a large sample size, the independent variables will

be statistically significant with an effect size of almost zero. Other studies have similar problems with

large sample sizes and all independent variables statistically significant. (Delaney, Chang, Senagore,

& Broder, 2008)

In a study of length of stay for the treatment of lung cancer, the sample size was 4979, but the treat-

ment under consideration was performed in 351 patients (7%), indicating a rare occurrence.(Wright et

al., 2008) The study did not adjust for the rare occurrence, nor did it report on the difference in the false

positive versus the false negative rate. Therefore, it is doubtful if the study has any real predictive capa-

bilities. Any model that predicts 100% as non-occurrences will be 93% accurate, so a good prediction

model would have to be more accurate than 93%. Another aspect of predictive modeling is that multiple

models are used and compared, defining a holdout sample (or minimizing costs) to find the optimal

choice. In traditional statistics, one model is chosen and used without any attempt to validate the model

choice, or to compare to other models. (Odueyungbo, Browne, Akhtar-Danesh, & Thabane, 2008)

The Society of Thoracic Surgeons has an excellent database repository; however, the statistical

methods used need to include predictive modeling rather than to just rely on logistic regression. In

particular, a study of atrial fibrillation from this dataset used dozens of variables. (Gammie et al., 2008)

Predictive modeling techniques are available to reduce the number of variables and to avoid the hazard

of over-fitting the model. However, these techniques are not commonly employed in medical research

studies. One such study started with 708,593 patients but did not define a holdout sample to validate

the results, nor did it compensate for the rare occurrence.(Mehta et al., 2008) One of the problems with

the analyses is that the database remains proprietary to a select group of investigators, similar to the

case with the SEARCH database, so there is no independent examination of the data or of the results.

(Boffa et al., 2008)

One of the great advantages of using these large databases is that it is possible to examine long-term

consequences of treatment for chronic illnesses. (Raaijmakers et al., 2008) It is also possible to inves-

tigate treatment decisions in relationship to patient demographics, and to investigate the possibility in

disparities in treatment choices by gender. (Aron, Nguyen, Stein, & Gill, 2008; Cho, Hoogwere, Huang,

Brennan, & Hazen, 2008), race or ethnicity, and by socio-economic status. It is also possible to use the

entire database and to use variable reduction techniques that are part of the data mining process rather

than to continue to use inclusion/exclusion criteria.(Hauptman, Swindle, burroughts, & Schnitzler, 2008)

These criteria can overlook important confounding factors that are always important to investigate in

observational studies. Dividing a study population on just one factor tends to neglect any examination of

confounders.(Good, Holschub, Albertson, & Eldridge, 2008) In this study by Good et.al., the determining

factor was grain consumption, completely overlooking the fact that there are many factors involved in

nutrition and diet, and they are all highly dependent on each other.(Good et al., 2008)

Occasionally, some studies use some data mining techniques, particularly the technique of separating

a holdout sample from the data used to define the model.(Moran, Bristow, Solomon, George, & Hart,

2008) One such study conducted in Australia examined mortality in intensive care (ICU). In addition,

by using receiver operating curves (ROC), the study examines the difference between false positives

and false negatives. However, this study still does not consider the fact that mortality remains a fairly

rare occurrence and the group sizes between mortality and non-mortality are quite different, influencing

model choice and results.

Text Mining Techniques for Healthcare Provider Quality Determination

Search WWH ::

Custom Search

Home