Biology Reference
In-Depth Information
s
X
n
2
D A ; B ¼
ð
A i
B i
Þ
;
i
¼
1
where A and B are two vectors of variables, indexed by i , and n variables long. Squar-
ing the result of the variable subtraction and then taking the square root of the final
sum means that only the absolute values of the differences are considered.
Consider two proteins for which a number of features have been measured:
Feature 1
Feature 2
Feature 3
Feature 4
Protein 1
20.8
153.2
23.5
54.8
Protein 2
25.4
120.0
22.8
65.2
The distance, D , between these proteins, using this representation, is:
q
20
2
2
2
2
D
¼
ð
Þ
þ
ð
Þ
þ
ð
Þ
þ
ð
Þ
:
8
24
:
4
153
:
2
120
:
0
23
:
5
22
:
8
54
:
8
65
:
2
¼
10
4
:
:
There are, of course, many other possible distance metrics that can be applied to
microbiological data. New metrics are being developed constantly, as new types
of data are generated and old types re-evaluated.
There are a number of clustering algorithms that are widely used because they are
easy to understand and interpret. One of the most popular is the k -nearest neighbour
( k -nn) algorithm ( Fukunaga and Narendra, 1975 ). As with most clustering algo-
rithms, k- nn starts with a matrix of distances between every item in the dataset.
The dataset must contain some items whose classification is known. The single
parameter k is chosen to be an odd number, and each item in the dataset is simply
assigned to the category held by the majority of its k nearest neighbours; that is,
the k data items which are closest to it using the chosen metric.
A slightly more sophisticated algorithm is k -means ( Hartigan and Wong, 1979 ).
For this algorithm, the user chooses the number of clusters, k , believed to exist in the
data. The mean, or centroid (centre of mass), of each cluster is assigned, either on the
basis of prior knowledge, or by selecting well-spaced items randomly from the data
to be clustered. The clustering process is iterative: the input data are read one at a
time, and each item is assigned to the cluster to whose centroid it is closest. The cen-
troids of the clusters are then re-calculated. This process continues until the centroids
no longer change. This algorithm is sensitive to both the number of clusters used, and
to the initial centroid assignment, so it is generally wise to repeat it multiple times,
varying both of these factors and examining the resulting clusters for plausibility.
Fortunately, this simple algorithm is very fast, even with large datasets, so the pro-
cess of automating a large number of runs is straightforward and efficient.
Once cluster membership is determined, it can be used for a number of purposes.
One of the most common is the prediction of the characteristics of unknown proteins
Search WWH ::




Custom Search