Graphics Reference
In-Depth Information
indicating that the rule base modeled is being largely affected by the noise. Thanks
to the use of the noise filter the inclusion of misleading rules is controlled, resulting
in a smoother drop in performance, even slower than that for SVM.
The last case is also very interesting. Being that C4.5 is more robust against
noise than SVM and Ripper, the accuracy drop over the increment of noise is lower.
However the use of noise filters is still recommended as they improve both the initial
case 0% and the rest of levels. The greater differences between not filtering and
the use of any filter are found in uniform class noise (Fig. 5.4 f). As we indicated
when describing the SVM case, uniform class noise is more disruptive but the use
of filtering for C4.5 make its performance comparable to the case of pairwise noise
(Fig. 5.4 e).
Although not depicted here, the size of C4.5 trees, Ripper rule base size and the
number of support vectors of SVM is lower with the usage of noise filters when
the noise amount increases, resulting in a shorter time when evaluating examples for
classification. This is specially critical for SVM, whose evaluation times dramatically
increase with the increment of selected support vectors.
5.5.3 Noise Filtering Efficacy Prediction by Data Complexity
Measures
In the previous Sect. 5.5.2 we have seen that the application of noise filters are
beneficial in most cases, especially when higher amounts of noise are present in
the data. However, applying a filter is not “free” in terms of computing time and
information loss. Indiscriminate application of noise filtering may be interpreted as
the outcome of the aforementioned example study, but it would be interesting to
study the noise filters' behavior further and to obtain hints about whether filtering is
useful or not depending on the data case.
In an ideal case, only the examples that are completely wrong would be erased
from the data set. The truth is both correct examples and examples containing valu-
able information may be removed, as the filters are ML techniques with their inher-
ent limitations. This fact implies that these techniques do not always provide an
improvement in performance. The success of these methods depends on several
circumstances, such as the kind and nature of the data errors, the quantity of noise
removed or the capabilities of the classifier to deal with the loss of useful information
related to the filtering. Therefore, the efficacy of noise filters, i.e., whether their use
causes an improvement in classifier performance, depends on the noise-robustness
and the generalization capabilities of the classifier used, but it also strongly depends
on the characteristics of the data.
Describing the characteristics of the data is not an easy task, as specifying what
“difficult” means is usually not straightforward or it simply does not depend on a sin-
gle factor. Data complexitymeasures are a recent proposal to represent characteristics
of the data that are considered difficult in classification tasks, e.g. the overlapping
 
Search WWH ::




Custom Search