Databases Reference
In-Depth Information
for decision tree building are examined which will be referred to as CMMC-1,
CMMC-2 and CMMC-3. Both methods are based on multiplication of the orig-
inal training set instances. Data points are generated from original training
set by creating copies of original training set instances by slightly changing
the values of attributes. First method is based on the variance of the gene
expression values and each attribute can be changed by adding the random
value from one of the intervals
3 σ and 3 σ,σ to the original gene
expression value. Because of the large number of attributes we change only
50% randomly selected attributes. Result of such data point multiplication
is a wide dispersion of the points around their base data point, but original
training set distribution of the samples is still preserved. Second method tries
to maintain the original distribution on even tighter area than the first one,
especially when data points lie tightly together. This is done by generating
the random points in the interval xd ,where x is the value of the attribute
and d is the distance to the nearest neighbour value of this attribute. Again
only 50% of attributes are randomly selected for modification. Another modi-
fication of the original approach was done in application of different ensemble
building method. Based on our own tests and also reports in some papers [16],
we decided to use Random Forest ensemble building method that is based on
one of the first ensemble building methods called bagging [17]. To compose
ensemble from base classifiers using bagging, each classifier is trained on a
set of n training examples, drawn randomly with replacement from the origi-
nal training set of size m. Such subset of examples is also called a bootstrap
replicate of the original set. Breiman upgraded the idea of bagging by com-
bining it with the random feature selection for decision trees. This way he
created Random Forests, where each member of the ensemble is trained on a
bootstrap replicate as in bagging. Decision trees are than grown by selecting
the feature to split on at each node from randomly selected number of nodes.
We set number of chosen features to log 2 ( k + 1) as in [18], where k is the to-
tal number of features. Random Forests are the ensemble method that works
well even with noisy content in the training dataset and are considered as one
of the most competitive methods that can be compared to boosting [19]. To
get the most out of the proposed multiplication of data points another ver-
sion of CMMC algorithm was derived. This version (CMMC-3) is based upon
multiplication of data points in each of the leafs which are later extended
by additional subtree. Since our decision trees are pruned they achieve good
generalization and are less complex than unpruned decision trees. Therefore
we try to “upgrade” each leaf by attaching another subtree under the leaf.
These subtrees are built using CMMC technique described above (basically
the same as CMMC-2). Thereby existing data points that got to the leaf are
multiplied (again by adding 1,000 artificial data points labelled by Random
Forest) and the problem of a small number of samples in lower nodes of trees
is reduced but not solved as we cannot be certain about the correct labelling
of the artificial samples.
1
σ,
Search WWH ::




Custom Search