Biomedical Engineering Reference
In-Depth Information
experimental treatment and one for a control array, you can visualize this as
a straight line between the two points in two-dimensional space.
Another commonly used distance metric is the Pearson correlation coef-
fi cient, which measures how correlated the profi les are. The Euclidian dis-
tance is very good at clustering together genes or samples that have a similar
profi le in amplitude, whereas the Pearson correlation is better at clustering
together profi les with the same shape regardless of their amplitude.
As is the case with statistical analysis, clustering techniques break down
with noise (especially with the Pearson distance metrics). It is therefore
highly recommended to remove genes/assays with a noisy profi le such as
genes that express at low levels in the background range of the array. The
cut-off can be arbitrary and inclusion can originate from a previous statis-
tical analysis. As a caution, be aware that performing a cluster analysis on
genes that were found signifi cant for separating two classes will illustrate
results you have discovered previously.
Once you decide on an appropriate distance metric, you will have to
select the method to perform the classifi cation. The most commonly used
methods are:
￿
hierarchical clustering (Eisen
et al. , 1998; Spellman et al. , 1998)
k-means clustering (Theilhaber
￿
et al., 2002)
self-organizing tree (Dopazo and Carazo, 1997; Herrero
￿
et al. , 2001)
￿
self-organizing maps (Tamayo
et al. , 1999)
principal component analysis (Raychaudhuri
￿
et al., 2000).
Hierarchical clustering is similar to a phylogenetic algorithm in that it com-
putes the distance between every two genes or samples and joins the closest
pair. It then computes all the distances between the genes including the
pair formed and keeps on joining the closest pairs until there is only one
big group left. This algorithm grows a tree with iteration and is classifi ed as
'agglomerative' (Plate II, see color section between pages 64 and 65).
K-means clustering, self-organizing maps and self-organizing trees are
divisive algorithms. They start with the whole data set and split it into
clusters. For example, with K-means, the user inputs how many clusters
he thinks there are in the dataset and then the software randomly assigns
genes to a cluster. It will then iteratively compute the average for the cluster
and reassign the genes to the cluster that they are the most similar to. After
a few hundred iterations, the cluster average stabilizes and all the genes
are assigned to their closest cluster. Since these algorithms start selecting
and clustering genes at random, they do not always yield the same results.
Some methods repeat the procedure a dozen to a hundred times and report
the consensus clusters (the genes that cluster together most of the time).
Nonetheless, even if you are able to reproduce the same result within
￿ ￿ ￿ ￿ ￿ ￿
 
Search WWH ::




Custom Search