Information Technology Reference
In-Depth Information
in the analysis, increasing the number of variables to be considered. Such studies can examine adverse
events, long term follow up and treatment interactions that are not possible with clinical trials.(Tirk-
konen et al., 2008)
Another factor that must be considered is that clinical databases tend to very large. They are so large
that the standard measure of a model's effectiveness, the p-value, will become statistically significant
with an effect size that is nearly zero. Therefore, other measures need to be used to measure a model's
effectiveness. In linear regression or the general linear model, it would not be unusual to have a model
that is statistically significant but with an r 2 value of 2% or less, suggesting that most of the variability
in the outcome variable remains unaccounted for.(Loebstein, Katzir, Vasterman-Landes, Halkin, &
Lomnicky, 2008) It is a sure sign that there are too many patient observations when most of the p-values
are equal to '<0.00001'.
Large datasets are also required to examine rare occurrences. There need to be a sufficient number
of rare occurrences in the database to be comparable. For example, if a condition occurs 0.1% of the
time, there would be approximately 1 such occurrence for every 1000 patients and 10 occurrences for
10,000 patients. It would require a minimum of 100,000 patients in the dataset to find 100 occurrences.
However, all 100,000 patients cannot be used in a model to predict the occurrences. The problem of rare
occurrences will be discussed in more detail in Chapter 3. The model would be nearly 99% accurate, but
would predict nearly every patient as a non-occurrence. In the absence of large samples and long-term
follow up, surrogate endpoints are still used.(Sabate, 1999)
Another advantage of using data mining techniques on large datasets is that we can investigate out-
comes at the patient level rather than at the group level. Typically in regression, we look to patient type
to determine those at high risk. Patients above a certain age represent one type. Patients who smoke
represent a type. However, with data mining, we can examine a patient of a specific age who smokes
10 cigarettes a week, who drinks one glass of wine on weekends, and who is physically in good shape
to predict specific outcomes.
Another measure we can consider is the fact that physicians vary in how they treat similar patients.
That variability itself can be used to examine the relationship between physician treatment decisions
and patient outcomes. Once we determine which outcome is “best” from the patient's viewpoint, we can
determine which treatment decisions are more likely to lead to that decision. This is particularly true
for patients with chronic illness where there is a sequence of treatment decisions followed by multiple
patient outcomes. For example, a patient with diabetes can start with medication, progressing to insulin
injections as the disease itself progresses. Moreover, patients with diabetes can end up with organ failure:
heart, kidney, and so on. We can examine treatments that prolong the time to such organ failure.
Another important problem in healthcare that can be examined using data mining has to do with
scheduling of personnel given patient needs. To examine solutions, we can use time series methods. We
can also examine physician prescribing habits to determine the impact of new drugs, or new procedures
and how they change patient care. Time series methods currently are under-utilized in the data analysis
of clinical databases.
Because of the value of these large databases, many professional medical societies are developing
registries.(Andaluz & Zuccarello, 2008; Hamilton et al., 2008; Wax, Srivastava, Shubikha, & Joashi,
2008) The SEARCH database discussed in Hamilton, et.al.(Hamilton et al., 2008) is quite popular in the
medical literature for outcomes research.(Turley et al., 2008) It is, however, propriety to a research study
group. It is not clear in Hamilton, et.al. that the statistical methods used to investigate the databases take
the large size into consideration. In many cases, it is difficult to tell if the extracted data are meaningful
Search WWH ::




Custom Search