Information Technology Reference
In-Depth Information
issue of evaluating the performance of the considered pattern classification mod-
ule. Indeed, it is very common to run into overestimated performance, in most
cases due to overfitting. This is a very critical issue in model validation and refers
to the lack of generalization of the classifier. This means that the classifier has,
in some way, adapted too much to the specific dataset used to train the model
and, providing it with unknown samples, fails or performs a bad classification:
the performance of a pattern analysis module must be evaluated on novel and
untrained samples.
Much attention must be paid to the procedure of estimating the real error
rate. A first possible approach to overcome these issues may be to randomly
partition the complete dataset into two subsets: a training set and a test set.
The first one is used to train the pattern classification module and, thus, to
find the best model, while the test set is used to estimate its accuracy. This
simple approach works well if the dataset is significantly large, otherwise it would
provide unreal performance estimations. Indeed it is strongly affected by the
specific properties of the chosen training and test set, that is, changing training
and test sets different performance would probably occur. To overcome this, one
solution could be the use of bootstrapping, that is a statistical technique that
generates multiple training-test partitions by resampling the original dataset
with replacement.
Another possible solution is to adopt a cross-validation approach, that is based
on the idea of performing multiple partitions of the dataset and averaging the
performance of the model across partitions. K -fold cross-validation performs K
data partitions, in a way that each data subset will be the test set in one of
the K partitions and part of the training set in the remaining ones. In this way,
each sample is used both for training and testing, assuring the most reliable
estimation of the error rate. Indeed, classifier performance in cross-validation
is estimated by averaging the errors obtained at each of the K iterations. A
specific case of K fold cross-validation is leave-one-out, where K is equal to
the number of data samples and the test set is composed at each iteration by
only one measure of the dataset. This is the approach adopted in the presented
work. However a little variation had to be done, transforming the leave-one-out
approach to a leave-one-subject-out approach. Since, as previously mentioned, we
acquired two samples for each subject breath, we needed to guarantee that none
of the two measures could belong to the training set while using the other one in
the test set. In the leave-one-subject-out cross-validation, each test set was thus
composed by the pair of measurements corresponding to the same person, instead
of a single measure as would be in the normal leave-one-out method. At each
iteration of the leave-one-subject-out cross-validation, we evaluated a confusion
matrix from which we extracted the corresponding performance indexes. At the
end of the process the mean and the variance of the performance indexes were
estimated. Being TruePositive (TP) a sick sample classified as sick, TrueNegative
(TN) a healthy sample classified as healthy, FalsePositive (FP) a healthy sample
classified as sick and FalseNegative (FN) a sick sample classified as healthy, the
performance indexes are defined as:
Search WWH ::




Custom Search