Information Technology Reference
In-Depth Information
Car dataset, letting just take from the minor-class the half of the patterns taken in the
method expressed in Table 2.4. If the ratio for training/test was 7:3, the positive
patterns chosen for training were 24, rather than 48 (it is the 3/10 of 69). The results
after applying this reduction in the number of patterns belonging to the minor-class
during the training phase, were practically the same than for 48 (g=0.93). Similar
results were found for the Hepatitis dataset (taking only 30 and 5 patterns of both
classes).
2.5 Future Trends
From the point of view of the accomplishment of this work, some research lines have
been opened. The first one is to try to solve the overfitting problem that appears when
training, by evaluating the existing overlapping in the classes. To do it, it is consid-
ered the use of the measurement proposed by S. Visa and A. Ralescu in 24, which
measures the degree of overlapping between datasets. Batista et al. 22 concluded in
their paper that the main problem in learning from imbalanced datasets is their degree
of overlapping between classes (the case that a boundary cannot be well defined).
Therefore, on the basis of this measurement a research line can be opened, in order to
anticipate whether there will or will not exist possibilities of finding overfitting, and
raising solutions for it. Within the same researching line, a new measurement also
could be proposed.
Another opened line, that has relation with the previous one, is to determine the
ideal number of rules (approximated) for a dataset. Now it is determined being based
on the number of membership function generated. In addition to this number of rules,
also this research line would be focused on determining the best values of the parame-
ters that are used in the FLAGID method: order of the patterns, reshrink operation,
discarding Fuzzy Points, etc. Now this process is made by means of carried out tests
with the training set, but it could be productive that some of these parameters were
calculated automatically from the dataset.
One of the problems in SVM with respect to its variation applied to the imbalanced
datasets, is the one of knowing which is the ideal C - /C + ratio. All the publications
pointed out that choosing the ratio between the number of patterns of each class as the
C - /C + ratio, already gave good results to them. In this case, a researching line is opened
in this field, since the Down's syndrome dataset does not fulfil this empirical rule.
A proposal to try to improve the probability of success in the evaluation of
knowing if a fetus is affected of Down's syndrome would be to try to combine both
methods, the method age/LR and the result of the FLAGID method.
Another line of research lies in trying to modify and to improve the FLAGID
method to work better for imbalanced datasets, improving its results. A way to apply
an improvement to it could be the combination of the results of different solutions
using bagging or the combination of this method with others by using boosting.
2.6 Conclusions
We have presented a method to work with imbalanced or highly imbalanced datasets,
called FLAGID. The method has been shown to give the same results than the used
Search WWH ::




Custom Search