Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

indicating that the rule base modeled is being largely affected by the noise. Thanks

to the use of the noise filter the inclusion of misleading rules is controlled, resulting

in a smoother drop in performance, even slower than that for SVM.

The last case is also very interesting. Being that C4.5 is more robust against

noise than SVM and Ripper, the accuracy drop over the increment of noise is lower.

However the use of noise filters is still recommended as they improve both the initial

case 0% and the rest of levels. The greater differences between not filtering and

the use of any filter are found in uniform class noise (Fig. 5.4 f). As we indicated

when describing the SVM case, uniform class noise is more disruptive but the use

of filtering for C4.5 make its performance comparable to the case of pairwise noise

(Fig. 5.4 e).

Although not depicted here, the size of C4.5 trees, Ripper rule base size and the

number of support vectors of SVM is lower with the usage of noise filters when

the noise amount increases, resulting in a shorter time when evaluating examples for

classification. This is specially critical for SVM, whose evaluation times dramatically

increase with the increment of selected support vectors.

5.5.3 Noise Filtering Efficacy Prediction by Data Complexity

Measures

In the previous Sect. 5.5.2 we have seen that the application of noise filters are

beneficial in most cases, especially when higher amounts of noise are present in

the data. However, applying a filter is not “free” in terms of computing time and

information loss. Indiscriminate application of noise filtering may be interpreted as

the outcome of the aforementioned example study, but it would be interesting to

study the noise filters' behavior further and to obtain hints about whether filtering is

useful or not depending on the data case.

In an ideal case, only the examples that are completely wrong would be erased

from the data set. The truth is both correct examples and examples containing valu-

able information may be removed, as the filters are ML techniques with their inher-

ent limitations. This fact implies that these techniques do not always provide an

improvement in performance. The success of these methods depends on several

circumstances, such as the kind and nature of the data errors, the quantity of noise

removed or the capabilities of the classifier to deal with the loss of useful information

related to the filtering. Therefore, the efficacy of noise filters, i.e., whether their use

causes an improvement in classifier performance, depends on the noise-robustness

and the generalization capabilities of the classifier used, but it also strongly depends

on the characteristics of the data.

Describing the characteristics of the data is not an easy task, as specifying what

“difficult” means is usually not straightforward or it simply does not depend on a sin-

gle factor. Data complexitymeasures are a recent proposal to represent characteristics

of the data that are considered difficult in classification tasks, e.g. the overlapping

Search WWH ::

Custom Search

Home