Biology Reference
In-Depth Information
One may reconstruct a phylogenetic tree among distinct groups or species from
morphological data obtained by measuring and quantifying the phenotypic properties
of representative organisms via, for example, parsimony. However, recent phyloge-
netic analysis uses nucleotide sequences encoding genes or amino acid sequences
encoding proteins as the basis for classification. Therefore in this chapter we focus on
phylogenetic tree reconstruction methods based on sequenced genes or genomic data
sets. Through evolutionary history, a character (e.g., a nucleotide) of the sequence
might be changed to another one, deleted or inserted. Thus, before reconstructing a
phylogenetic tree, we have to align an input sequence data set, e.g., identify sequences
of characters (DNA bases, amino acids, etc.) which are thought to be representing
the same (“homologous”) regions of genes (from different species or different gene
families, etc.) and then line up the sequences so that nucleotides in differing sequences
can be compared site-by-site, where one site may vary from another by mutation, i.e.,
insertions, deletions, or substitutions of characters. Such a line-up of two sequences
is an alignment ; if we have multiple sequences then the result is properly called a
multiple alignment , although we will often abuse terminology and use “alignment”
for both. Aligning multiple sequences is generally known to be an NP-hard prob-
lem [ 7 , 8 ], and there are scores of approaches to creating alignments heuristically.
Here we assume that we have a perfectly aligned sequence data set and focus on the
reconstruction of phylogenetic trees from a multiple alignment.
There are several methods to infer a phylogenetic tree from a given alignment,
including the maximum-likelihood (ML) method, distance-based methods,
parsimony-based methods, Bayesian inference methods, and so on (see [ 1 ]formore
details). In this chapter we will focus on distance-based methods.
Distance-based methods build phylogenetic trees from aligned sequences by mak-
ing use of pairwise “distances” between the sequences. These distances arise from a
model of sequence evolution, or evolutionary model , which encodes certain hypothe-
ses about how sequences evolve, often including the probabilities that one character
at a given site will transition to another, as well as assumptions about how sequences
as a whole then transform, via evolution, into one another. Models of sequence evolu-
tion are needed to address the problem that observed sequences may have experienced
much more change over time that their elements alone might show; by way of analogy,
just as an observation that a light switch is currently off does not alone completely
specify the number of times previously that it had been flipped on and off again. Many
common models of sequence evolution are, in the lingo of mathematicians, given by
a continuous-time Markov model with a substitution rate matrix whose entries are
the probabilities of characters changing to one another.
Each such evolutionary model given by continuous-time Markov chains corre-
sponds to a phylogenetic tree for which each node is a sequence, and the (directed)
edges link sequences which evolve from one to another by data given by substitu-
tion rate matrices. In this way, the phylogenetic tree summarizes the relationships
between the species (or other organisms represented by the sequences) in terms of
common ancestors (nodes) and evolutionary changes via edge lengths (e.g., times
since divergence, number of substitution events, etc.).
 
Search WWH ::




Custom Search