Databases Reference
In-Depth Information
3.2 Gene Selection
It has been shown that selecting a small subset of informative genes can lead
to improved classification accuracy and greatly improves execution time of
data mining tools [25]. The most commonly used gene selection approaches
are based on gene ranking. In these gene ranking approaches, each gene is
evaluated individually and assigned a score representing its correlation with
the class. Genes are then ranked by their scores and the top ranked ones
are selected from the initial set of features (genes). To make our experiments
less dependent of the filtering method, we use three different filtering methods.
This way we get 12 different microarray datasets with a pre-defined number of
most relevant gene expressions. All used filtering methods are part of WEKA
toolkit [26] that we were using in our experiments. The following filtering
methods were used:
GainRatio filter . This is the heuristic that was originally used by Quinlan
in ID3 [27]. It is implemented in WEKA as a simple and fast feature selection
method. The idea of using this feature selection technique for gene ranking
was already presented by Ben-Dor et al. [28].
Relief-F filter . The basic idea of Relief-F algorithm [29] is to draw instances
at random, compute their nearest neighbors, and adjust a feature weighting
vector to give more weight to features that discriminate the instance from
neighbors of different classes. A study comparing Relief-F to other similar
methods in microarray classification domain was conducted by Wang and
Makedon [30] where they conclude that the performance of Relief-F is com-
parable with other methods.
SVM filter . Ranking is done using Support Vector Machines (SVM) classi-
fier. Similar approach using SVM classifier for gene selection was already used
in papers by Guyon et al. [31] and Fujarewicz et al. [32].
3.3 Experiment Setting
The experiments are designed to test the accuracy gain of all three CMMC
methods compared to accuracy of a single J48 tree (Java C4.5 tree imple-
mentation in WEKA toolkit). The study has followed n-fold cross-validation
process for testing. The n-fold cross-validation is typically implemented by
running the same learning system n times and each time on a different train-
ing set of size (n-1)/n times the size of the original data set. A specific vari-
ation of n-fold cross-validation, called leave-one-out cross-validation method
(LOOCV), is used in the experiment. In this approach, one sample in the
training set is withheld, the remaining samples of the training set are used
to build a classifier to predict the class of withheld sample, and the cumula-
tive error is then calculated. LOOCV was often criticized, because of higher
error variance in comparison to five or tenfold cross-validation [33], but a re-
cent study by Braga-Neto and Dougherty [34] shows that LOOCV can be
considered very useful for microarray datasets, because they have not been
Search WWH ::




Custom Search