Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

represent any sort of functional relationship, such as a biological pathway ( Hallinan

et al. , 2009 ). Decision trees are a supervised form of learning, whereas clustering is

unsupervised. Supervised algorithms are trained using examples of known category,

whereas unsupervised algorithms do not rely upon the existence of a labelled training

set. There are, again, literally hundreds of clustering algorithms; a recent textbook

runs to four volumes ( Byrne and Uprichard, 2012 ), a degree of coverage which

we shall not attempt to emulate here.

The idea of clustering is intuitively attractive. Biology is inherently modular, and

when a researcher is confronted with a large mass of data, the ability to organise it into

at least semi-coherent groups can simplify analysis tremendously. One of the major

aims of much DNA microarray analysis, for example, is to identify groups of genes

that have similar expression profiles either over time or over a range of conditions.

Just about any sort of data can be clustered, from gene expression measurements,

to taxonomic relationships, to interaction networks. However, therein lies a problem.

Many clustering algorithms, when asked to find n clusters in a dataset, will find n

clusters, whether or not they actually exist. Furthermore, many clustering algorithms

are stochastic, and may produce different cluster memberships when applied repeat-

edly to the same data, or when the same data is presented to the algorithm in a dif-

ferent order. As always with data mining, there is currently no substitute for human

expertise when it comes to the evaluation of clustering results.

All clustering algorithms are based upon the concept of distance between records.

Members of a cluster are presumed to be closer to each other, in some easily com-

putationally interpreted way, than members of separate clusters. The most widely

used distance metric in microbiology is undoubtedly generated by an all-against-

all BLAST, which generates a matrix of similarity values between each pair of pro-

teins in the input dataset.

Another very simple distance metric, applicable to items which are described in

terms of strings of integers, is the Hamming distance ( Hamming, 1950 ). Consider

two proteins, represented in terms of the domains, D1-D9, which they contain. In this

representation 0 indicates that the domain is not present, while 1 indicates that it is.

D1

D2

D3

D4

D5

D6

D7

D8

D9

Protein 1

0

1

0

1

0

1

Protein 2

0

1

0

1

0

1

The proteins differ at three positions: D2, D4 and D8. The Hamming distance

between them is therefore simply 3.

If the variables describing the item are real numbers, the Euclidean distance is the

simplest distance metric. To calculate the Euclidean distance one subtracts the value

of the variables at the same position in each vector, squares the result, sums all of the

values, and takes the square root of the sum:

Methods in Microbiology

Search WWH ::

Custom Search

Home