Data mining for microbiologists - Methods in Microbiology - page 67

Biology Reference

In-Depth Information

part of the training process. Therefore, the performance of the algorithm should be

reported as that achieved on the completely unseen test set.

If there are insufficient data to establish three separate datasets, the next-most-

parsimonious approach is the use of cross-validation. With cross-validation the data

is divided into a number of datasets, often either three or ten, depending upon the

amount of data available. One dataset is held out, and the algorithm trained on

the rest. The held-out data is then classified using the trained classifier. This process

is repeated with each held-out dataset in turn. The end result of this approach is that,

in every case, the data is classified by a classifier on which it was not trained. The

trade-off is that, since the classifiers are trained upon smaller subsets of data, they are

likely to perform less well than a classifier trained upon the entire dataset. The ulti-

mate form of cross-validation is the “leave-one-out cross-validation” approach, in

which each case is individually held-out in turn ( Witten et al. , 2011 ).

8.3 The peaking phenomenon

It appears to be intuitively obvious that providing more descriptive variables to a

data mining algorithm will improve its performance. To some extent this assump-

tion is valid. However, all data contain noise as well as the specific signal, partic-

ularly that generated by high-throughput approaches. Eventually, the addition of

new variables will actually degrade rather than enhance an algorithm's perfor-

mance, a scenario known as the peaking phenomenon ( Figure 2.19 )( Sima and

Dougherty, 2008 ). The peaking phenomenon does not always occur, but should

be tested for by running the algorithms on different-sized subsets of the available

data to check the effect, on the accuracy of the output, of the number of variables

included in the analysis.

1

0.9

0.8

0.7

0.6

0.5

0.4

0.3

0.2

0.1

0

0

2

4

6

8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40

Number of variables

FIGURE 2.19

The peaking phenomenon. As variables are added to an analysis, the accuracy of the

classification initially rises. Eventually, the addition of more variables introduces more noise

than signal, and the performance of the algorithm deteriorates.

Next Page

Methods in Microbiology

Search WWH ::

Custom Search

Home