Biology Reference
In-Depth Information
based upon their cluster membership: the so-called guilt-by-association approach
( Altschuler et al. , 2000 ). Cluster membership has been widely used in this way to
predict protein function ( Sharan et al. , 2007; Mostafavi et al. , 2008 ).
Cluster analysis has been applied within microbiology to address a wide range of
questions. Perhaps the most common application is to the analysis of the composition
of microbial communities under natural conditions ( Noble et al. , 1997; Juck et al. ,
2000; Zhang and Fang, 2000; Blackwood et al. , 2003 ), when disturbed by agriculture
or pollution ( Rooney-Varga et al. , 1999; Juck et al. , 2000 ), and in disease ( Frank
et al. , 2007 ). Cluster analysis has also been used for the identification of transcrip-
tional modules ( Leyfer, 2005 ); investigating the evolution of pathogenicity ( Keim
et al. , 2000; Tettelin et al. , 2005 ); and gene identification and protein classification
( Yooseph et al. , 2008 ).
Cluster analysis algorithms are not always designed to produce a neat set of clus-
ters with clearly defined membership. Hierarchical clustering methods, of which
there are many, generate a tree, or dendrogram, in which different levels of the tree
represent different granularities of the clustering. Data items can be assembled into a
tree by agglomeration , in which the items that are closest together, on the basis of the
distance metric chosen, are iteratively grouped together. Alternately, a divisive pro-
cedure can be used, with the entire dataset initially considered as a single cluster, and
then divided into successively smaller clusters on the basis of distance between the
cluster members or cluster centroids.
A hierarchical tree terminates in a set of leaf nodes, each of which contains a sin-
gle member of the original dataset, making it valuable for examining the relation-
ships between individuals. However, a cluster tree can also be thresholded at
higher levels of granularity, to investigate relationships between groups of individ-
uals ( Figure 2.10 ). In many phylogenetic trees (although not in Figure 2.10 ) the
length of the vertical lines represents evolutionary time since the last devolutionary
split.
Probably the most familiar application of hierarchical clustering in microbiology
is for the construction of phylogenetic trees, which, we hope, reflect the evolutionary
relationships between organisms. Phylogenetic trees are usually built on the basis of
distances calculated between hypervariable regions of the genome. The most widely
used regions are the 16S rRNA genes, which have been used since the late 1970s
( Woese and Fox, 1977 ).
Although the genes used for phylogenetic analysis have not changed for 30 years,
the technologies used to obtain genetic data are constantly changing and developing,
necessitating the development of new approaches to analysis. Most recently, the
advent of Next Generation sequencing has made possible the generation of huge
amounts of sequence data, quickly and cheaply, albeit in the form of short reads
of 100-200 bp. Such data requires new techniques for cluster analysis, which take
into account the nature of the data being analysed ( Huse et al. , 2010; Lemos
et al. , 2011; Foster et al. , 2012 ).
Another very widely used application of cluster analysis is the investigation of
time-course DNA microarray data ( Figure 2.11 ). The aim of many microarray
Search WWH ::




Custom Search