Explaining Medical Model Exceptions - Computational Intelligence in Healthcare: Advanced Methodologies

Information Technology Reference

In-Depth Information

4 Missing Data

Databases with many variables have specific problems. Since it is very difficult to

overview their content, usually a priory a user does not know how complete is a data

set. Is there any data missing? How many of them and where are they located?

In the dialysis data set, many data are missing randomly and without any known

regularity. The main cause is that many measurements did not happen.

It can be assumed that the data set contains groups of interdependent variables but

a priory it is not known how many such groups there are, what kind of variables are

dependent, and in which way they are dependent. However, we intend to make use of

all possible forms of dependency to impute missing data, because the more complete

the observed data base is, the easier it should be to find explanations for exceptional

cases and furthermore the better the explanations should be. Even for setting up the

model the expert user should select those parameters as main factors with only few

missing data. So, the more data are imputed, the better the choice for setting up the

model can be.

A data analysis method is often assessed according to its tolerance to missing data

(e.g. in [25]). In principle, there are two main approaches to the missing data problem.

The first approach is a statistical imputation of missing data. Usually it is based on

non-missing data from other records.

The second approach suggests methods that accept the absence of some data. The

methods of this approach can be differently advanced, from simply excluding cases

with missing values up to rather sophisticated statistical models [26, 27].

Gediga and Düntsch [28] propose the use of CBR to impute missing data. Since

their approach does not require any external information, they call it a “non-invasive

imputation method”. Missing data are supposed to be replaced by their correspondent

values of the most similar retrieved cases. However, the dialysis data set contains

rather few patients, which means that the “most similar” case for a query case might

not be very similar in fact.

So, why don't we just apply statistical methods? Statistical methods require

homogeneity of the sample. However, there are no reasons to expect the set of dialy-

sis patients to be a homogenous sample. Since the data consists of many parameters,

sometimes missing values can be calculated or estimated from other parameter values.

Furthermore, the number of cases in the data set is rather small, whereas usually sta-

tistical methods are more appropriate the bigger the number of cases.

4.1 The Data Set

For each patient a set of physiological parameters is measured. These parameters con-

tain information about burned calories, maximal power, oxygen pulse (volume of

oxygen consumption per heartbeat), lung ventilation, and many others. Furthermore,

there are biochemical parameters like haemoglobin and other laboratory measure-

ments. All these parameters are supposed to be measured four times during the first

year of participating in the fitness program. There is an initial measurement followed

by a next one after three months, then after six months, and finally after a year. Since

some parameters, e.g. the height of a patient, are supposed to remain constant within a

year, they were measured just once. The other ones are regarded as factors with four

Computational Intelligence in Healthcare: Advanced Methodologies

Search WWH ::

Custom Search

Home