Introduction to Ranking Models - Text Mining Techniques for Healthcare Provider Quality Determination

Information Technology Reference

In-Depth Information

in the analysis, increasing the number of variables to be considered. Such studies can examine adverse

events, long term follow up and treatment interactions that are not possible with clinical trials.(Tirk-

konen et al., 2008)

Another factor that must be considered is that clinical databases tend to very large. They are so large

that the standard measure of a model's effectiveness, the p-value, will become statistically significant

with an effect size that is nearly zero. Therefore, other measures need to be used to measure a model's

effectiveness. In linear regression or the general linear model, it would not be unusual to have a model

that is statistically significant but with an r 2 value of 2% or less, suggesting that most of the variability

in the outcome variable remains unaccounted for.(Loebstein, Katzir, Vasterman-Landes, Halkin, &

Lomnicky, 2008) It is a sure sign that there are too many patient observations when most of the p-values

are equal to '<0.00001'.

Large datasets are also required to examine rare occurrences. There need to be a sufficient number

of rare occurrences in the database to be comparable. For example, if a condition occurs 0.1% of the

time, there would be approximately 1 such occurrence for every 1000 patients and 10 occurrences for

10,000 patients. It would require a minimum of 100,000 patients in the dataset to find 100 occurrences.

However, all 100,000 patients cannot be used in a model to predict the occurrences. The problem of rare

occurrences will be discussed in more detail in Chapter 3. The model would be nearly 99% accurate, but

would predict nearly every patient as a non-occurrence. In the absence of large samples and long-term

follow up, surrogate endpoints are still used.(Sabate, 1999)

Another advantage of using data mining techniques on large datasets is that we can investigate out-

comes at the patient level rather than at the group level. Typically in regression, we look to patient type

to determine those at high risk. Patients above a certain age represent one type. Patients who smoke

represent a type. However, with data mining, we can examine a patient of a specific age who smokes

10 cigarettes a week, who drinks one glass of wine on weekends, and who is physically in good shape

to predict specific outcomes.

Another measure we can consider is the fact that physicians vary in how they treat similar patients.

That variability itself can be used to examine the relationship between physician treatment decisions

and patient outcomes. Once we determine which outcome is “best” from the patient's viewpoint, we can

determine which treatment decisions are more likely to lead to that decision. This is particularly true

for patients with chronic illness where there is a sequence of treatment decisions followed by multiple

patient outcomes. For example, a patient with diabetes can start with medication, progressing to insulin

injections as the disease itself progresses. Moreover, patients with diabetes can end up with organ failure:

heart, kidney, and so on. We can examine treatments that prolong the time to such organ failure.

Another important problem in healthcare that can be examined using data mining has to do with

scheduling of personnel given patient needs. To examine solutions, we can use time series methods. We

can also examine physician prescribing habits to determine the impact of new drugs, or new procedures

and how they change patient care. Time series methods currently are under-utilized in the data analysis

of clinical databases.

Because of the value of these large databases, many professional medical societies are developing

registries.(Andaluz & Zuccarello, 2008; Hamilton et al., 2008; Wax, Srivastava, Shubikha, & Joashi,

2008) The SEARCH database discussed in Hamilton, et.al.(Hamilton et al., 2008) is quite popular in the

medical literature for outcomes research.(Turley et al., 2008) It is, however, propriety to a research study

group. It is not clear in Hamilton, et.al. that the statistical methods used to investigate the databases take

the large size into consideration. In many cases, it is difficult to tell if the extracted data are meaningful

Text Mining Techniques for Healthcare Provider Quality Determination

Search WWH ::

Custom Search

Home