Information Technology Reference
In-Depth Information
except the age of the mother. This variable is not included because it will be an error
to reduce the quantity of information of its contribution to solve the problem. Thus, in
case of MoM just 3 variables will be taken into account. With respect to the output
variable, it only indicates if the fetus has or has not the Down's syndrome. From now
on, the non-MoM variables will be called physical variables.
From these variables, the medical team made us to consider take into account:
white race women 1 and with a single fetus, in order to simplify the problem, since the
changes in race or number of fetuses produces significant changes in the value of the
hormonal markers.
Therefore, the input variables are the age of the mother, its weight, the gestational
age of the fetus, the existence of diabetes, the consumption degree of tobacco and
alcohol, and the hormonal markers AFP and hCG, and their respective MoMs. As the
MoMs include almost all the input variables, the variables have been divided into two
groups:
1. One formed by the age of the mother and both MoM (MoM-AFP and MoM-hCG).
2. Other formed by the 8 variables that are not expressed in MoMs: the age of the
mother, its weight, the gestational age of the fetus, the existence of diabetes, the
hormonal consumption degree of tobacco and alcohol, and AFP and hCG markers.
The data is divided into two groups: one with 3109 cases (3096 negatives and 13
positives) and another with 4995 (4980 negatives and 15 positives), ordered chrono-
logically, so the cases of the second group are later than the cases of the first group.
The data is numerical, has 2 output classes (has or has not Down's syndrome) and its
imbalance ratio is approximately 1:300, i.e., highly imbalanced dataset. In Table 2.1 it
is shown the characteristics of the two datasets and the total. The last set, the
Down_total, is the sum of the two previous ones.
Table 2.1. Characteristics of the 2 datasets and the total
Name
#patterns
#neg
#pos
%neg
%pos
Down_3109
3109
3096
13
99.60%
0.40%
Down_4995
4995
4980
15
99.70%
0.30%
Down_total
8922
8892
30
99.66%
0.34%
In the current method age/LR, the hormones markers and its MoM are truncated by
their upper and lower limits, due to an adjustment of the function to achieve a Gaus-
sian shape, since it is fulfilled in the central part but not in the ends. In order to fulfill
the Gaussian shape, the approximate limits are settled down in 3 times the variance.
In our case, we considered that just the upper limits have to be truncated, and the
truncation is 5 times the MoM (e.g. in case of MoM the upper limit is 5), due to the
upper values are really spread, producing a big scope of the variables which have
the information concentrated in a smaller area (the 95% of the cases concentrated in
the 10% of its space). This fact may cause problems in the extraction of a learning
model and for that reason the data is truncated. Table 2.2 shows the new upper limits.
1 In Spain, the white race is majority.
Search WWH ::




Custom Search