Extracting a Fuzzy System by Using Genetic Algorithms for Imbalanced Datasets Classification: Application on Down’s Syndrome Detection - Mining Complex Data

Information Technology Reference

In-Depth Information

except the age of the mother. This variable is not included because it will be an error

to reduce the quantity of information of its contribution to solve the problem. Thus, in

case of MoM just 3 variables will be taken into account. With respect to the output

variable, it only indicates if the fetus has or has not the Down's syndrome. From now

on, the non-MoM variables will be called physical variables.

From these variables, the medical team made us to consider take into account:

white race women 1 and with a single fetus, in order to simplify the problem, since the

changes in race or number of fetuses produces significant changes in the value of the

hormonal markers.

Therefore, the input variables are the age of the mother, its weight, the gestational

age of the fetus, the existence of diabetes, the consumption degree of tobacco and

alcohol, and the hormonal markers AFP and hCG, and their respective MoMs. As the

MoMs include almost all the input variables, the variables have been divided into two

groups:

1. One formed by the age of the mother and both MoM (MoM-AFP and MoM-hCG).

2. Other formed by the 8 variables that are not expressed in MoMs: the age of the

mother, its weight, the gestational age of the fetus, the existence of diabetes, the

hormonal consumption degree of tobacco and alcohol, and AFP and hCG markers.

The data is divided into two groups: one with 3109 cases (3096 negatives and 13

positives) and another with 4995 (4980 negatives and 15 positives), ordered chrono-

logically, so the cases of the second group are later than the cases of the first group.

The data is numerical, has 2 output classes (has or has not Down's syndrome) and its

imbalance ratio is approximately 1:300, i.e., highly imbalanced dataset. In Table 2.1 it

is shown the characteristics of the two datasets and the total. The last set, the

Down_total, is the sum of the two previous ones.

Table 2.1. Characteristics of the 2 datasets and the total

Name

#patterns

#neg

#pos

%neg

%pos

Down_3109

3109

3096

13

99.60%

0.40%

Down_4995

4995

4980

15

99.70%

0.30%

Down_total

8922

8892

30

99.66%

0.34%

In the current method age/LR, the hormones markers and its MoM are truncated by

their upper and lower limits, due to an adjustment of the function to achieve a Gaus-

sian shape, since it is fulfilled in the central part but not in the ends. In order to fulfill

the Gaussian shape, the approximate limits are settled down in 3 times the variance.

In our case, we considered that just the upper limits have to be truncated, and the

truncation is 5 times the MoM (e.g. in case of MoM the upper limit is 5), due to the

upper values are really spread, producing a big scope of the variables which have

the information concentrated in a smaller area (the 95% of the cases concentrated in

the 10% of its space). This fact may cause problems in the extraction of a learning

model and for that reason the data is truncated. Table 2.2 shows the new upper limits.

1 In Spain, the white race is majority.

Mining Complex Data

Search WWH ::

Custom Search

Home