Data mining for microbiologists - Methods in Microbiology

Biology Reference

In-Depth Information

D A ; B ¼

A i

B i

;

where A and B are two vectors of variables, indexed by i , and n variables long. Squar-

ing the result of the variable subtraction and then taking the square root of the final

sum means that only the absolute values of the differences are considered.

Consider two proteins for which a number of features have been measured:

Feature 1

Feature 2

Feature 3

Feature 4

Protein 1

20.8

153.2

23.5

54.8

Protein 2

25.4

120.0

22.8

65.2

The distance, D , between these proteins, using this representation, is:

153

120

There are, of course, many other possible distance metrics that can be applied to

microbiological data. New metrics are being developed constantly, as new types

of data are generated and old types re-evaluated.

There are a number of clustering algorithms that are widely used because they are

easy to understand and interpret. One of the most popular is the k -nearest neighbour

( k -nn) algorithm ( Fukunaga and Narendra, 1975 ). As with most clustering algo-

rithms, k- nn starts with a matrix of distances between every item in the dataset.

The dataset must contain some items whose classification is known. The single

parameter k is chosen to be an odd number, and each item in the dataset is simply

assigned to the category held by the majority of its k nearest neighbours; that is,

the k data items which are closest to it using the chosen metric.

A slightly more sophisticated algorithm is k -means ( Hartigan and Wong, 1979 ).

For this algorithm, the user chooses the number of clusters, k , believed to exist in the

data. The mean, or centroid (centre of mass), of each cluster is assigned, either on the

basis of prior knowledge, or by selecting well-spaced items randomly from the data

to be clustered. The clustering process is iterative: the input data are read one at a

time, and each item is assigned to the cluster to whose centroid it is closest. The cen-

troids of the clusters are then re-calculated. This process continues until the centroids

no longer change. This algorithm is sensitive to both the number of clusters used, and

to the initial centroid assignment, so it is generally wise to repeat it multiple times,

varying both of these factors and examining the resulting clusters for plausibility.

Fortunately, this simple algorithm is very fast, even with large datasets, so the pro-

cess of automating a large number of runs is straightforward and efficient.

Once cluster membership is determined, it can be used for a number of purposes.

One of the most common is the prediction of the characteristics of unknown proteins

Methods in Microbiology

Search WWH ::

Custom Search

Home