Dealing with Noisy Data - Data Preprocessing in Data Mining - page 116

Graphics Reference

In-Depth Information

Table 5.1 Filtering approaches by category as of [ 19 ]

Detection based on thresholding of a measure

Partition filtering for large data sets

Measure: classification confidence

[ 82 ]

For large and distributed data sets

[ 102 , 103 ]

Least complex correct hypothesis

[ 24 ]

Model influence

Classifier predictions based

LOOPC

[ 57 ]

Cost sensitive learning based

[ 101 ]

Single perceptron perturbation

[ 33 ]

SVM based

[ 86 ]

Nearest neighbor based

ANN based

[ 42 ]

CNN

[ 30 ]

Multi classifier system

[ 65 ]

BBNR

[ 15 ]

C4.5

[ 44 ]

IB3

[ 3 ]

Nearest instances to a candidate

[ 78 , 79 ]

Tomek links

[ 88 ]

Voting filtering

PRISM

[ 81 ]

Ensembles

[ 10 , 11 ] DROP

[ 93 ]

Bagging

[ 89 ]

Graph connections based

ORBoost

[ 45 ]

Grabiel graphs

[ 18 ]

Edge analysis

[ 92 ]

Neighborhood graphs

[ 66 ]

sensitive to class noise as well [ 74 ]. This instability has make them very suitable for

ensemble methods. As a countermeasure for this lack of stability some strategies can

be used. The first one is to carefully select an appropriate splitting criteria measure.

In [ 2 ] several measures are compared to minimize the impact of label noise in the

constructed trees, empirically showing that the imprecise info-gain measure is able

to improve the accuracy and reduce the tree growing size produced by the noise.

Another approach typically described as useful to deal with noise in decision trees

is the use of pruning. Pruning tries to stop the overfitting caused by the overspecial-

ization over the isolated (and usually noisy) examples. The work of [ 1 ] eventually

shows that the usage of pruning helps to reduce the effect and impact of the noise in

the modeled trees. C4.5 is the most famous decision tree and it includes this pruning

strategy by default, and can be easily adapted to split under the desired criteria.

We have seen that the usage of ensembles is a good strategy to create accurate

and robust filters. Whether an ensemble of classifiers is robust or not against noise

can be also asked.

Many ensemble approaches exist and their noise robustness has been tested. An

ensemble is a system where the base learners are all of the same type built to be as

varied as possible. The two most classic approaches bagging and boosting were com-

pared in [ 16 ] showing that bagging obtains better performance than boosting when

label noise is present. The reason shown in [ 1 ] indicates that boosting (or the par-

ticular implementation made by AdaBoost) increase the weights of noisy instances

too much, making the model construction inefficient and imprecise, whereas mis-

labeled instances favour the variability of the base classifiers in bagging [ 19 ]. As

AdaBoost is not the only boosting algorithm, other implementations as LogitBoost

and BrownBoost have been checked as more robust to class noise [ 64 ]. When the base

Next Page

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home