Database Reference
In-Depth Information
R <R<
R
2
R
where
is the optimal (minimal) error rate for the underlying distributions
p i ,i
.
This performance, however, is demonstrated for the training set size tending
to infinity, and thus, is not really applicable to real world problems, in which
we usually have a training set of about hundreds or thousands cases, too little,
anyway, for the number of probability estimations to be done.
More extensions to the k -NN approach could be seen in [6, 1, 25, 17]. More
effort has to be done in the K-NN paradigm in order to reduce the number of
cases of the training database to obtain faster classifications [6, 26].
=1
,
2
,...,M
4
Proposed Approach
In boosting techniques, a distribution or set of weights over the training set is
maintained. On each execution, the weights of incorrectly classified examples are
increased so that the base learner is forced to focus on the hard examples in the
training set. A good description of boosting can be found in [8].
Following the idea of focusing in the hard examples, we wanted to know if one
algorithm could be used to boost a different one, in a simple way. We have chosen
two well-known algorithms, k -NN and ID3, and our approach (in the following
we will refer to it as k -NN-boosting) works as follows:
- Find the incorrectly classified instances in the training set using k -NN over
the training set but the instance to be classified
- Duplicate the instances incorrectly classified in the previous step
-
Apply ID3 to the augmented training set
Let us note that this approach is equivalent to duplicate the weight of incor-
rectly classified instances, according to k -NN.
In this manner, the core of this new approach consists of inflating the training
database adding the cases misclassified by the k -NN algorithm, and then learn
the classification tree from the new database obtained. It has to be said that this
approach increases the computational cost only in the model induction phase,
while the classification costs are the same as in the original ID3 paradigm.
Modifying the instance distribution in the training dataset, two major effects
can be obtained:
- Election of a different variable to split at some node
- Change in the decision about pruning the tree at some point
4.1
Change in the Variable to Split
Let us suppose the training set is formed by twelve cases, six of them belonging
to class A and the remaining six to class B.
In figure 4 is depicted an example on the change of information gain after the
edition of the training set. The number in parentheses are in the form (#instances
 
Search WWH ::




Custom Search