Dealing with Noisy Data - Data Preprocessing in Data Mining

Graphics Reference

In-Depth Information

hypothesis could be the same as one of the main motivations for combining classi-

fiers: the improvement of the generalization capability (due to the complementarity

of each classifier), which is a key question in noisy environments, since it might

allow one to avoid the overfitting of the new characteristics introduced by the noisy

examples [ 84 ]. Most of the works studying MCSs and noisy data are focused on

techniques like bagging and boosting [ 16 , 47 , 56 ], which introduce diversity con-

sidering different samples of the set of training examples and use only one baseline

classifier. For example, in [ 16 ] the suitability of randomization, bagging and boosting

to improve the performance of C4.5 was studied. The authors reached the conclu-

sion that with a low noise level, boosting is usually more accurate than bagging and

randomization. However, bagging outperforms the other methods when the noise

level increases. Similar conclusions were obtained in the paper of Maclin and Opitz

[ 56 ]. Other works [ 47 ] compare the performance of boosting and bagging techniques

dealing with imbalanced and noisy data, reaching also the conclusion that bagging

methods generally outperforms boosting ones. Nevertheless, explicit studies about

the adequacy of MCSs (different from bagging and boosting, that is, those introduc-

ing diversity using different base classifiers) to deal with noisy data have not been

carried out yet. Furthermore, most of the existing works are focused on a concrete

type of noise and on a concrete combination rule. On the other hand, when data is

suffering from noise, a proper study on how the robustness of each single method

influences the robustness of the MCS is necessary, but this fact is usually overlooked

in the literature.

There are several strategies to usemore than one classifier for a single classification

task [ 36 ]:

•

Dynamic classifier selection This is based on the fact that one classifier may

outperform all others using a global performance measure but it may not be the

best in all parts of the domain. Therefore, these types of methods divide the input

domain into several parts and aim to select the classifier with the best performance

in that part.

•

Multi-stage organization This builds the classifiers iteratively. At each iteration, a

group of classifiers operates in parallel and their decisions are then combined. A

dynamic selector decides which classifiers are to be activated at each stage based

on the classification performances of each classifier in previous stages.

•

Sequential approach A classifier is used first and the other ones are used only if

the first does not yield a decision with sufficient confidence.

•

Parallel approach All available classifiers are used for the same input example

in parallel. The outputs from each classifier are then combined to obtain the final

prediction.

Although the first three approaches have been explored to a certain extent, the

majority of classifier combination research focuses on the fourth approach, due to its

simplicity and the fact that it enables one to take advantage of the factors presented

in the previous section. For these reasons, this topic focus on the fourth approach.

Data Preprocessing in Data Mining

Search WWH ::

Custom Search

Home