Information Technology Reference
In-Depth Information
4 Missing Data
Databases with many variables have specific problems. Since it is very difficult to
overview their content, usually a priory a user does not know how complete is a data
set. Is there any data missing? How many of them and where are they located?
In the dialysis data set, many data are missing randomly and without any known
regularity. The main cause is that many measurements did not happen.
It can be assumed that the data set contains groups of interdependent variables but
a priory it is not known how many such groups there are, what kind of variables are
dependent, and in which way they are dependent. However, we intend to make use of
all possible forms of dependency to impute missing data, because the more complete
the observed data base is, the easier it should be to find explanations for exceptional
cases and furthermore the better the explanations should be. Even for setting up the
model the expert user should select those parameters as main factors with only few
missing data. So, the more data are imputed, the better the choice for setting up the
model can be.
A data analysis method is often assessed according to its tolerance to missing data
(e.g. in [25]). In principle, there are two main approaches to the missing data problem.
The first approach is a statistical imputation of missing data. Usually it is based on
non-missing data from other records.
The second approach suggests methods that accept the absence of some data. The
methods of this approach can be differently advanced, from simply excluding cases
with missing values up to rather sophisticated statistical models [26, 27].
Gediga and Düntsch [28] propose the use of CBR to impute missing data. Since
their approach does not require any external information, they call it a “non-invasive
imputation method”. Missing data are supposed to be replaced by their correspondent
values of the most similar retrieved cases. However, the dialysis data set contains
rather few patients, which means that the “most similar” case for a query case might
not be very similar in fact.
So, why don't we just apply statistical methods? Statistical methods require
homogeneity of the sample. However, there are no reasons to expect the set of dialy-
sis patients to be a homogenous sample. Since the data consists of many parameters,
sometimes missing values can be calculated or estimated from other parameter values.
Furthermore, the number of cases in the data set is rather small, whereas usually sta-
tistical methods are more appropriate the bigger the number of cases.
4.1 The Data Set
For each patient a set of physiological parameters is measured. These parameters con-
tain information about burned calories, maximal power, oxygen pulse (volume of
oxygen consumption per heartbeat), lung ventilation, and many others. Furthermore,
there are biochemical parameters like haemoglobin and other laboratory measure-
ments. All these parameters are supposed to be measured four times during the first
year of participating in the fitness program. There is an initial measurement followed
by a next one after three months, then after six months, and finally after a year. Since
some parameters, e.g. the height of a patient, are supposed to remain constant within a
year, they were measured just once. The other ones are regarded as factors with four
Search WWH ::




Custom Search