Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

and diseased ( Schippa et al. , 2010; Sobhani et al. , 2011 ), the soil ( Delmont et al. ,

2011 ); assessing the antibacterial activity of microbes ( Wang et al. , 2010 ); and many

investigations into microbial community fingerprints ( Illian et al. , 2009; Xu et al. ,

2010; Zhang et al. , 2010a ).

5 MACHINE LEARNING TECHNIQUES

Statistical approaches have the advantages of being well established, with a solid

mathematical foundation, and are generally well accepted by the scientific commu-

nity. Wherever possible, statistical approaches should be tried on large, complex

datasets. However, the price to pay for the strong foundation of statistical techniques

is the limitations on the type of data that can be handled, and, often, the assumptions

made about the distributions of the variables to be explored. When data are very large

or messy, or there is no plausible hypothesis to be tested, heuristically based machine

learning techniques can provide unique insights into the underlying biological

processes.

5.1 Support vector machines

A support vector machine (SVM) is yet another type of classification algorithm

( Boser et al. , 1992 ). SVMs attempt to identify a separating hyperplane between clas-

ses, in a manner similar to discriminant analysis. It differs, however, from discrim-

inant analysis in the way in which the hyperplane is selected. An SVM attempts to

select the hyperplane that is in the middle of the gap between the categories, and

therefore maximally far away from both classes of data. However, real data is rarely

cleanly separable, and the SVM algorithm allows for some misclassifications. The

proportion of misclassification is controlled by a user-adjustable parameter, known

as the soft margin .

SVMs are characterised by the use of a Kernal function that adds an extra dimen-

sion to the data, essentially projecting it from a low-dimensional space into a higher-

dimensional space. Data are more widely scattered in higher-dimensional spaces,

and are therefore often more easily separable. It has been proven that for any data

set there exists a Kernal function which will allow the data to be linearly separated

( Noble, 2006 ), but the task of identifying this function is a black art, and kernals are

usually chosen by trial and error. SVMs can be extended to handle more than two

classes of data in a relatively straightforward manner ( Lee et al. , 2004b; Noble,

2004 ). SVMs also have the advantage of not assuming that the training data is nor-

mally distributed.

The advantages of SVMs for classifier construction mean that they have been

widely used in a number of fields, including microbiology. They have been used

for tasks such as predicting the subcellular localisation of proteins ( Gardy et al. ,

2005; Rashid et al. , 2007 ), gene finding ( Krause et al. , 2007 ), analysis of proteins

( Rausch et al. , 2005 ) and protein function classification ( Cai et al. , 2003 ).

Search WWH ::

Custom Search

Home