Biomedical Engineering Reference
In-Depth Information
Since the basis for the learning approaches inherent in these different algorithms is beyond
the scope of the present review, we have left out the details of these methods.
Once machine learning has been carried out, there remains the problem of validation.
Validation involves the task of determining the expected error rate from applying the
trained classifier to a new data set. For large data sets, where generating new copies is not
possible, the common machine learning validation technique called cross-validation is
often used. For example, in tenfold cross-validation, the entire data set is divided into ten
groups. The classifier model is created using nine of the groups and the classifier is then
tested on the 10th group. This process is repeated iteratively, until each of the ten groups
has served as the test group. The ten estimates are then averaged, providing an overall
accuracy distribution and error estimate of the predictive power of the classifier. Another
variant of this technique is the leave-one-out validation method. In this technique, one
data point is left out of each of the iterations used to create the classifier model. This data
point is then used to test the classifier. The process is repeated until every data point has
been left out once for testing the classifier and the accuracy estimates are then averaged.
In machine learning approaches, classifiers do best when the number of dimensions or
variables within the data set is small (less than 100) and the number of data points is large
(greater than 1000)—a minimum ratio of 1:10. However, in many data sets this is not the
case. In fact, the very opposite may be true and we illustrate this below with the case of
nucleic acid microarray biosensor measurements.
1.4.2
Application of Machine Learning to the Analysis of High-Dimensionality Data
From Microarray Biosensors
As we discussed previously, nucleic acid microarrays are a type of optical-based biosensor
employing hybridization of analyte nucleic acids from an unknown sample to comple-
mentary surface-immobilized nucleic acid sequence probes from known genes (160).
Within a single microarray area, the number of immobilized probes, or dimensions of data
obtained from them, may be as high as 30,000. In the case of human gene probes forming
the microarray, this would represent nearly the entire human gene complement. In con-
trast, the number of data points determined for each data dimension, representing differ-
ent patient's tissue samples as one example, may be as little as 30. For this example the
data dimension:data point ratio is 1000:1, very much the reverse of the desired ratio (1:10)
in data mining situations that we discussed above. Even though each of the 30,000 probes
are immobilized on a small surface area, around (70
m) 2 /probe, these nucleic acid
'biochips' totalling a few cm 2 , are costly and collecting many data points in this type of
biosensor experiment may entail unreasonable expense. In Figure 1.51 we show a repre-
sentative example of the hybridization intensities from a microarray spotted surface (174).
Each spot signal intensity is measured using an automated reading system. Initially, sim-
ple plotting is often performed of the typical quantitative fluorescence output of the
biosensor's 30,000 individual hybridization signals, to ascertain the quantitative level of
each of the complementary mRNA sequences in the biological sample. However, to use
machine learning effectively on these data sets, one is often faced with reducing the
dimensionality of the data set. A number of statistical techniques, such as pairwise t and
F statistics, can be used to reduce the dimensionality. The resulting aim is to select the min-
imum number of dimensions that best discriminate among classes and eliminate all those
dimensions that provide no class discrimination or in fact prevent it (175).
The Center for Intelligent Biomaterials has been involved in informatics and data mining
analyses of large data sets' from nucleic acid array biosensors as well as whole cell assay
biosensors. This was prior to and subsequent to the formation of AnVil, Inc., a venture-backed
Search WWH ::




Custom Search