Information Technology Reference
In-Depth Information
As synthetic experiment we draw a dataset from a 22 dimensions two-class prob-
lem. The first two dimensions are drawn from the classical XOR problem, while
the remaining 20 dimensions are drawn from the normal distribution. The first two
dimensions are useful to classify the two classes (i.e. informative dimensions), while
the remaining dimensions are noise. The dataset contains 1,000 samples equally dis-
tributed over the two classes. In this experiment, as well as in all experiments of
the following subsection, we followed a 10-fold cross-validation procedure: in each
fold the 90% of the samples are used to build the EDBFM matrix and to weight the
original features; the remaining samples are used to evaluate the performance of the
method. In particular, for each fold a weight model is calculated on an incrementing
number of features taken in the rank order from the test set to extract a projection
along the first informative features. Hence we firstly obtain two datasets with the
most important feature, then two datasets with the first two most important features,
and so forth until the full-dimensional datasets (i.e. the original ones) are returned.
For each of these pairs of datasets the Nearest Neighbor algorithm is used to esti-
mate the accuracy. After the tenth fold repetition, the weights and the accuracies are
averaged by rank, and curves are built, which represent the average accuracy that the
method achieves over all folds as a function of the most informative features.
The experimental work-flow is depicted in Fig. 4.3 a, it consists of two phases:
first the appliance of the EDBFM based ranking method to the multivariate dataset
in the filter mode of [ 26 ], and then the validation procedure. The process is sketched
in the following pseudocode.
Fig. 4.3 a General work-flow of feature selection techniques. b Example of two classes classifica-
tion problems. Piecewise lines represent the approximation of the Bayes boundary found by BVQ.
y 1 and y 2 represent the two most important extracted features
Search WWH ::




Custom Search