Knowledge Extraction from Microarray Datasets Using Combined Multiple Models to Predict Leukemia Types - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

3.2 Gene Selection

It has been shown that selecting a small subset of informative genes can lead

to improved classification accuracy and greatly improves execution time of

data mining tools [25]. The most commonly used gene selection approaches

are based on gene ranking. In these gene ranking approaches, each gene is

evaluated individually and assigned a score representing its correlation with

the class. Genes are then ranked by their scores and the top ranked ones

are selected from the initial set of features (genes). To make our experiments

less dependent of the filtering method, we use three different filtering methods.

This way we get 12 different microarray datasets with a pre-defined number of

most relevant gene expressions. All used filtering methods are part of WEKA

toolkit [26] that we were using in our experiments. The following filtering

methods were used:

GainRatio filter . This is the heuristic that was originally used by Quinlan

in ID3 [27]. It is implemented in WEKA as a simple and fast feature selection

method. The idea of using this feature selection technique for gene ranking

was already presented by Ben-Dor et al. [28].

Relief-F filter . The basic idea of Relief-F algorithm [29] is to draw instances

at random, compute their nearest neighbors, and adjust a feature weighting

vector to give more weight to features that discriminate the instance from

neighbors of different classes. A study comparing Relief-F to other similar

methods in microarray classification domain was conducted by Wang and

Makedon [30] where they conclude that the performance of Relief-F is com-

parable with other methods.

SVM filter . Ranking is done using Support Vector Machines (SVM) classi-

fier. Similar approach using SVM classifier for gene selection was already used

in papers by Guyon et al. [31] and Fujarewicz et al. [32].

3.3 Experiment Setting

The experiments are designed to test the accuracy gain of all three CMMC

methods compared to accuracy of a single J48 tree (Java C4.5 tree imple-

mentation in WEKA toolkit). The study has followed n-fold cross-validation

process for testing. The n-fold cross-validation is typically implemented by

running the same learning system n times and each time on a different train-

ing set of size (n-1)/n times the size of the original data set. A specific vari-

ation of n-fold cross-validation, called leave-one-out cross-validation method

(LOOCV), is used in the experiment. In this approach, one sample in the

training set is withheld, the remaining samples of the training set are used

to build a classifier to predict the class of withheld sample, and the cumula-

tive error is then calculated. LOOCV was often criticized, because of higher

error variance in comparison to five or tenfold cross-validation [33], but a re-

cent study by Braga-Neto and Dougherty [34] shows that LOOCV can be

considered very useful for microarray datasets, because they have not been

Search WWH ::

Custom Search

Home