Biology Reference
In-Depth Information
part of the training process. Therefore, the performance of the algorithm should be
reported as that achieved on the completely unseen test set.
If there are insufficient data to establish three separate datasets, the next-most-
parsimonious approach is the use of cross-validation. With cross-validation the data
is divided into a number of datasets, often either three or ten, depending upon the
amount of data available. One dataset is held out, and the algorithm trained on
the rest. The held-out data is then classified using the trained classifier. This process
is repeated with each held-out dataset in turn. The end result of this approach is that,
in every case, the data is classified by a classifier on which it was not trained. The
trade-off is that, since the classifiers are trained upon smaller subsets of data, they are
likely to perform less well than a classifier trained upon the entire dataset. The ulti-
mate form of cross-validation is the “leave-one-out cross-validation” approach, in
which each case is individually held-out in turn ( Witten et al. , 2011 ).
8.3 The peaking phenomenon
It appears to be intuitively obvious that providing more descriptive variables to a
data mining algorithm will improve its performance. To some extent this assump-
tion is valid. However, all data contain noise as well as the specific signal, partic-
ularly that generated by high-throughput approaches. Eventually, the addition of
new variables will actually degrade rather than enhance an algorithm's perfor-
mance, a scenario known as the peaking phenomenon ( Figure 2.19 )( Sima and
Dougherty, 2008 ). The peaking phenomenon does not always occur, but should
be tested for by running the algorithms on different-sized subsets of the available
data to check the effect, on the accuracy of the output, of the number of variables
included in the analysis.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2
4
6
8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40
Number of variables
FIGURE 2.19
The peaking phenomenon. As variables are added to an analysis, the accuracy of the
classification initially rises. Eventually, the addition of more variables introduces more noise
than signal, and the performance of the algorithm deteriorates.
Search WWH ::




Custom Search