Biology Reference
In-Depth Information
algorithms are affected by the presence of noise in a dataset, and these effects must be
borne in mind during the analysis.
8.2 Overfitting
One of the issues that often arises is overfitting. Overfitting occurs when an algorithm
that is trained using labelled data learns the characteristics of the training data toowell,
to the point where its performance upon previously unseen data deteriorates.
Many CI-oriented data mining algorithms are prone to overfitting. All datasets
contain both signals—the patterns in the data that are important—and noise—errors
due to random chance or variation in equipment performance. The aim of a data min-
ing algorithm is to learn the characteristics of the signal. However, many algorithms
will learn the noise as well, particularly if the training dataset is small.
Noise, due to its random nature, will be unique to a particular training set, and
overfitting thus leads to a decrease in the performance of the algorithm on unseen
data ( Figure 2.18 ). Noise can be reduced, for example, by using multiple forms of
measurement, but can never be completely eliminated.
To avoid overfitting it is important to make the most effective use of the available
data. If enough data are available, the ideal situation is to have three completely sep-
arate datasets: training, validation and test. The training dataset is used, on its own, to
train the algorithm. The performance of the trained algorithm is then assessed using
the previously unseen validation dataset. If the performance of the algorithm is not
adequate, changes can be made to the training set, and the algorithm re-trained and
re-validated. By using the validation set in this way, however, it essentially becomes
1
0.9
0.8
0.7
0.6
0.5
0
10
20
30
40
50
60
70
80
90
100
Size of tree (number of nodes)
Test data
Training data
FIGURE 2.18
Overfitting leads to decreased performance of an algorithm on unseen data. In this case
performance is recorded for a set of decision trees with varying numbers of nodes.
Search WWH ::




Custom Search