Knowledge Extraction from Microarray Datasets Using Combined Multiple Models to Predict Leukemia Types - Data Mining: Foundations and Practice

Databases Reference

In-Depth Information

for decision tree building are examined which will be referred to as CMMC-1,

CMMC-2 and CMMC-3. Both methods are based on multiplication of the orig-

inal training set instances. Data points are generated from original training

set by creating copies of original training set instances by slightly changing

the values of attributes. First method is based on the variance of the gene

expression values and each attribute can be changed by adding the random

value from one of the intervals −

3 σ and 3 σ,σ to the original gene

expression value. Because of the large number of attributes we change only

50% randomly selected attributes. Result of such data point multiplication

is a wide dispersion of the points around their base data point, but original

training set distribution of the samples is still preserved. Second method tries

to maintain the original distribution on even tighter area than the first one,

especially when data points lie tightly together. This is done by generating

the random points in the interval xd ,where x is the value of the attribute

and d is the distance to the nearest neighbour value of this attribute. Again

only 50% of attributes are randomly selected for modification. Another modi-

fication of the original approach was done in application of different ensemble

building method. Based on our own tests and also reports in some papers [16],

we decided to use Random Forest ensemble building method that is based on

one of the first ensemble building methods called bagging [17]. To compose

ensemble from base classifiers using bagging, each classifier is trained on a

set of n training examples, drawn randomly with replacement from the origi-

nal training set of size m. Such subset of examples is also called a bootstrap

replicate of the original set. Breiman upgraded the idea of bagging by com-

bining it with the random feature selection for decision trees. This way he

created Random Forests, where each member of the ensemble is trained on a

bootstrap replicate as in bagging. Decision trees are than grown by selecting

the feature to split on at each node from randomly selected number of nodes.

We set number of chosen features to log 2 ( k + 1) as in [18], where k is the to-

tal number of features. Random Forests are the ensemble method that works

well even with noisy content in the training dataset and are considered as one

of the most competitive methods that can be compared to boosting [19]. To

get the most out of the proposed multiplication of data points another ver-

sion of CMMC algorithm was derived. This version (CMMC-3) is based upon

multiplication of data points in each of the leafs which are later extended

by additional subtree. Since our decision trees are pruned they achieve good

generalization and are less complex than unpruned decision trees. Therefore

we try to “upgrade” each leaf by attaching another subtree under the leaf.

These subtrees are built using CMMC technique described above (basically

the same as CMMC-2). Thereby existing data points that got to the leaf are

multiplied (again by adding 1,000 artificial data points labelled by Random

Forest) and the problem of a small number of samples in lower nodes of trees

is reduced but not solved as we cannot be certain about the correct labelling

of the artificial samples.

1

σ,

−

Data Mining: Foundations and Practice

Search WWH ::

Custom Search

Home