Biology Reference
In-Depth Information
represent any sort of functional relationship, such as a biological pathway ( Hallinan
et al. , 2009 ). Decision trees are a supervised form of learning, whereas clustering is
unsupervised. Supervised algorithms are trained using examples of known category,
whereas unsupervised algorithms do not rely upon the existence of a labelled training
set. There are, again, literally hundreds of clustering algorithms; a recent textbook
runs to four volumes ( Byrne and Uprichard, 2012 ), a degree of coverage which
we shall not attempt to emulate here.
The idea of clustering is intuitively attractive. Biology is inherently modular, and
when a researcher is confronted with a large mass of data, the ability to organise it into
at least semi-coherent groups can simplify analysis tremendously. One of the major
aims of much DNA microarray analysis, for example, is to identify groups of genes
that have similar expression profiles either over time or over a range of conditions.
Just about any sort of data can be clustered, from gene expression measurements,
to taxonomic relationships, to interaction networks. However, therein lies a problem.
Many clustering algorithms, when asked to find n clusters in a dataset, will find n
clusters, whether or not they actually exist. Furthermore, many clustering algorithms
are stochastic, and may produce different cluster memberships when applied repeat-
edly to the same data, or when the same data is presented to the algorithm in a dif-
ferent order. As always with data mining, there is currently no substitute for human
expertise when it comes to the evaluation of clustering results.
All clustering algorithms are based upon the concept of distance between records.
Members of a cluster are presumed to be closer to each other, in some easily com-
putationally interpreted way, than members of separate clusters. The most widely
used distance metric in microbiology is undoubtedly generated by an all-against-
all BLAST, which generates a matrix of similarity values between each pair of pro-
teins in the input dataset.
Another very simple distance metric, applicable to items which are described in
terms of strings of integers, is the Hamming distance ( Hamming, 1950 ). Consider
two proteins, represented in terms of the domains, D1-D9, which they contain. In this
representation 0 indicates that the domain is not present, while 1 indicates that it is.
D1
D2
D3
D4
D5
D6
D7
D8
D9
Protein 1
0
0
1
1
0
1
0
0
1
Protein 2
0
1
1
0
0
1
0
1
1
The proteins differ at three positions: D2, D4 and D8. The Hamming distance
between them is therefore simply 3.
If the variables describing the item are real numbers, the Euclidean distance is the
simplest distance metric. To calculate the Euclidean distance one subtracts the value
of the variables at the same position in each vector, squares the result, sums all of the
values, and takes the square root of the sum:
Search WWH ::




Custom Search