Biology Reference
In-Depth Information
and diseased ( Schippa et al. , 2010; Sobhani et al. , 2011 ), the soil ( Delmont et al. ,
2011 ); assessing the antibacterial activity of microbes ( Wang et al. , 2010 ); and many
investigations into microbial community fingerprints ( Illian et al. , 2009; Xu et al. ,
2010; Zhang et al. , 2010a ).
5 MACHINE LEARNING TECHNIQUES
Statistical approaches have the advantages of being well established, with a solid
mathematical foundation, and are generally well accepted by the scientific commu-
nity. Wherever possible, statistical approaches should be tried on large, complex
datasets. However, the price to pay for the strong foundation of statistical techniques
is the limitations on the type of data that can be handled, and, often, the assumptions
made about the distributions of the variables to be explored. When data are very large
or messy, or there is no plausible hypothesis to be tested, heuristically based machine
learning techniques can provide unique insights into the underlying biological
processes.
5.1 Support vector machines
A support vector machine (SVM) is yet another type of classification algorithm
( Boser et al. , 1992 ). SVMs attempt to identify a separating hyperplane between clas-
ses, in a manner similar to discriminant analysis. It differs, however, from discrim-
inant analysis in the way in which the hyperplane is selected. An SVM attempts to
select the hyperplane that is in the middle of the gap between the categories, and
therefore maximally far away from both classes of data. However, real data is rarely
cleanly separable, and the SVM algorithm allows for some misclassifications. The
proportion of misclassification is controlled by a user-adjustable parameter, known
as the soft margin .
SVMs are characterised by the use of a Kernal function that adds an extra dimen-
sion to the data, essentially projecting it from a low-dimensional space into a higher-
dimensional space. Data are more widely scattered in higher-dimensional spaces,
and are therefore often more easily separable. It has been proven that for any data
set there exists a Kernal function which will allow the data to be linearly separated
( Noble, 2006 ), but the task of identifying this function is a black art, and kernals are
usually chosen by trial and error. SVMs can be extended to handle more than two
classes of data in a relatively straightforward manner ( Lee et al. , 2004b; Noble,
2004 ). SVMs also have the advantage of not assuming that the training data is nor-
mally distributed.
The advantages of SVMs for classifier construction mean that they have been
widely used in a number of fields, including microbiology. They have been used
for tasks such as predicting the subcellular localisation of proteins ( Gardy et al. ,
2005; Rashid et al. , 2007 ), gene finding ( Krause et al. , 2007 ), analysis of proteins
( Rausch et al. , 2005 ) and protein function classification ( Cai et al. , 2003 ).
Search WWH ::




Custom Search