Pair hidden Markov models (Bioinformatics)

1. Introduction

Many of the early contributions of computer science to biological sequence analysis consisted of the development of algorithms for pairwise sequence alignment, most notably the Needleman-Wunsch algorithm (Needleman and Wunsch, 1970), which was later extended and refined by Smith, Waterman (Waterman and Smith, 1981), and others. It was only 20 years later, when probabilistic models began to be developed for related problems in sequence analysis that it was first suggested that some of the classical algorithms might have a probabilistic interpretation (Bucher and Hofmann, 1996). This observation has led to numerous developments: more rigorous approaches to parameter estimation (Liu etal., 1995), novel annotation systems based on combinations of alignment and gene finding models (Alexan-dersson etal., 2003), and Bayesian sampling methods for alignments (Liu etal., 1995) to name a few.

The probabilistic interpretation of alignment is based on modeling sequences using pair hidden Markov models (PHMMs). Our goal in this chapter is to explain these models, and their connection to hidden Markov models (HMMs) and phylogenetic models of evolution. These models are now widely used in computational biology, and a complete understanding of their foundations is essential to their proper application and for the development of novel more powerful models that will capture the complexities of sequence evolution. The chapter is organized as follows: we begin by explaining evolutionary models for single bases. This is the beginning of a rich theory, and although we avoid exploring the topic in depth, we point out the crucial nonidentifiability property for two taxa models. This property plays a key role in pairwise sequence alignment. Next, we describe HMMs, which capture positional dependence in biological sequences, and are therefore relevant for PHMMs. We then turn to a complete description of PHMMs, which are instances of Bayesian multinets. These are graphical models in which the structure of the graph can depend on the assignments to nodes. This distinction places PHMMs in a separate category from standard HMMs. In fact, the term PHMM suggests that PHMMs are close relatives of HMMs, which is misleading; PHMMs are significantly more complex than HMMs. For biologists, the relevance of PHMMs stems from the fact that they can be used to infer alignments of sequences. The standard alignment algorithms, such as the Needleman-Wunsch algorithm and its variants, can be interpreted as Viterbi algorithms for PHMMs. This is explained in Section 4 in which we also show that PHMMs can be used to sample alignments efficiently. We conclude by describing biologically relevant extensions of PHMMs.


2. What is an alignment?

PHMMs are mainly used to find alignments between pairs of sequences, so we begin by explaining exactly what we mean by pairwise alignment.

Definition 1. A pairwise alignment between two sequences a1 and a2 from a finite alphabet is a sequence on the alphabet H, I, D that has the property that #H + #I = |a 1| and #H + #D = |a2|. By #H, #I, #D we mean the number of occurrences of those characters in the alignment string.

An alignment captures the relationship between two sequences that have evolved from a common ancestor via a series of mutations of bases, represented by H that stands for homology, I (insertions), or D (deletions). The alignment string records the events that led to the present day sequences. Alternatively, alignment can be viewed as an edit of a1 to produce a2. That is, we allow characters from a1 to be changed or deleted, or new characters to be inserted. An alignment can be thought of, equivalently, as a record of the edits from a1 to a2. We make this statement precise in the following section.

The main problem in sequence alignment is to identify good alignments that explain the evolutionary history on the basis of parameters for mutation, insertion, and deletion. Mutation mechanisms are accounted for in PHMMs with the use of evolutionary models. An underlying HMM structure is used to model local sequence characteristics, or insertion/deletion structures. The challenging computational task is to find good alignments among the set of all possible alignments. In particular, if An,m is the set of all alignments for two strings of length n and m, the number of elements of An,m is the Delannoy number

tmp83-58_thumb

This number is exponential in the lengths of the sequences, leading to a nontrivial computational problem in identifying good alignments.

3. Models

3.1. Evolutionary models for two bases

PHMMs capture point mutations, as well as insertions and deletions in biological sequences. We begin by carefully analyzing single base evolutionary models that are at the core of PHMMs. We restrict ourselves to DNA. Single base mutation is modeled by continuous time Markov chains:

Definition 2. A rate matrix (or Q-matrix) is a 4 x 4 matrix Q = (qy) (with rows and columns indexed by the nucleotides) satisfying the properties

tmp83-59_thumb

Rate matrices capture the notion of instantaneous rate of mutation and allow us to calculate the substitution matrices P (t), giving the probability that a given base will mutate from one nucleotide to another over a given period of time. The entry of P (t) in row i and column j equals the probability that the substitution i ^—-> j occurs in a time interval of length t. We use the notation P (i, j ;t) to denote such an entry. Given a rate matrix Q, these probabilities will be governed by a matrix of differential equations whose solution is provided by the following

thpnrpm’

tmp83-60_thumb

1. P(s + t) = P(s)P(t),

2. P(t) is the unique solution to P’(t) = P(t) • Q, P(0) = 1 for t > 0,

3. P(t) is the unique solution to P’(t) = Q • P(t), P(0) = 1 for t > 0.

Furthermore, a matrix Q is a rate matrix if and only if the matrix P(t) = eQt is a stochastic matrix (nonnegative with row sums equal to one) for every t.

By fixing the entries in a rate matrix and specifying some time t *, we obtain a matrix P (t *) that completely specifies the probabilities for observing the four nucleotides after time t *, starting with each of the nucleotides A, C, G, or T. One key feature of the evolutionary models that we will consider is reversibility. In the case in which the background base frequencies are uniform, this property is equivalent to requiring that the substitution matrices are symmetric.

Substitution matrices can be combined along edges of a tree to form a phylo-genetic model. We consider only the simplest case consisting of an ancestral base r, which after a speciation event mutates into a base a that we observe after time 11 and a separate base b that we observe after time 12. If P(i, j; ti, t2) = Pr(a = i, b = j; t1, t2) is the probability of observing i and j at a and b after times 11 and t2 respectively, then we can write the probability

tmp83-61_thumb

where, again, we have assumed a uniform distribution at r. If we include the assumption that P(t) is symmetric for all t, then this completely specifies a statistical model that we call the general reversible model for two taxa. The key property of reversible models that we will need for PHMMs is the nonidentifiability of the two taxa general reversible model:

tmp83-62_thumb

Proof: Since P(t) is symmetric for all t, P(i, j; t) = P(j, i; t) for all i and j. Now as above, we have that

tmp83-63_thumb

However, the right-hand side of equation (4) is just 4 times the (i, j )th entry of the matrix P(t1)P(t2) = P(t1 +12). Similarly, we will have that P(i, j;t3, t4) will be 4 times the (i, j )th entry of the matrix P(t3 + 14). The result now follows from the assumption that t1 +12 = t3 + t4.

This proposition shows that we arrive at the same distribution on pairs of nucleotides if we set one of our branch lengths to 0, in which case the model describes not two bases evolving from a common ancestor but rather of one base changing directly into the other. This fact lies at the heart of our ability to refer to alignment as the editing of one sequence into the other, rather than as a description of the evolutionary history from a common ancestor.

3.2. Phylogenetic models for two sequences

In order to describe the evolution of entire sequences, rather than single bases, we make use of HMMs. In this section, we use them to describe an evolutionary model for sequences, which allows for point mutation and dependence between neighboring nucleotides, without allowing for insertion or deletion, a complexity that is added in the next section.

Definition 5. A discrete hidden Markov model consists of n observed states Y1, … , Y n taking on l possible values, and n hidden states X1, … , Xn taking on k possible values. The HMM can be characterized by the following conditional independence statements for i = 1,… ,n:

tmp83-64_thumb

We consider the model with uniform initial distribution, where all probabilities p(Xi |Xi-1) are given by the same k x k matrix and all probabilities p(Yi |Xi) are given by the same l x k matrix.

The relationship of HMMs to the modeling of pairs of sequences that evolved from a common ancestral sequence is as follows: the random variables Y1, … , Yn take on l = 16 possible values (consisting of all pairs of nucleotides). The hidden random variables X i, … , Xn take on k = 4 possible states corresponding to the four nucleotides A, C, G, T. The hidden values play the role of the ancestral sequence. Fixing the conditional transition and output probabilities in the HMM results in a probability distribution being defined on pairs of sequences. If the output probabilities are constrained to a reversible model, it is clear from the previous discussion that we have described an evolutionary model for sequences that allows for dependence between neighboring nucleotides in the ancestral sequence.

3.3. Modeling insertion and deletion

We are now ready to introduce PHMMs, which add insertions and deletions into the model in Section 3.2. Graphically, this model is represented in Figure 1(a). The boxes around the nodes, called plates, are used to indicate that the nodes may not be present. The presence of a node is determined by class nodes, small circles in the plates that in our case take on the values 0 (node missing) or 1 (node present). The hidden states take on the values H, I1, I2 and D1, D2. In the H state, an ancestral base is generated, and from it two observed bases. In the D states, an ancestral base is generated but only one observed base. In the I states, no ancestral base is generated, and one observed base is generated.

Alternatively (and equivalently, because of the nonidentifiability discussed in Section 3.1), we can describe a PHMM with the model in figure 1b. In this case, there are three hidden states: H, I and D. In the I state a base is generated in sequence 1, and in the D state a base is generated in sequence 2. In the H state, two bases are generated from a joint distribution (note that the model as drawn technically requires four H states, one for each base, but this is usually omitted in favor of simply writing the joint probability distribution on the observed nodes directly). The model then consists of a distribution on the initial state of the model, {ni}k= 1, transition probabilities sj = Pr(Xl+1 = j\Xl = i), and an output distribution from each state j, bj(x, y) = Pr(x, y\Xl = j), where x, y range over all possible pairings of bases, allowing for the null base. Note that the fact that the output distribution now includes “null characters” allows for the instances where one or the other observed nodes is not present in the model. For example, bJ(A, 0) is the probability of an A being observed at the first node when the second node is not present in the model.

Given two sequences a1 and a2 of lengths N and M respectively, this model allows us to assign a probability to any alignment h in An,m. The alignment is a sequence of Hs, Is, and Ds and this will be our sequence of hidden states. Adopting the notation of Pachter and Sturmfels (2005), we define h [i ] to be #H + #I in the prefix h 1h2 … hi of h, and h (i} to be #H + #D in h 1h2 … hi. Thus, h [i] is our position in a1 at step i of the alignment and h (i) is our position in a2. Then the probability of the alignment h according to our model will be

tmp83-65_thumbThe pair hidden Markov model

Figure 1 The pair hidden Markov model

From this we can calculate the probability of two sequences:

tmp83-67_thumb

Despite the name, PHMMs are nontrivial modifications of standard HMM models, in that the structure of the models is not fixed. In the graphical model literature, these types of models are referred to as Bayesian multinets (Friedman etal., 1997).

4. Algorithms

4.1. Finding a best alignment

The model described in the previous section, for some fixed parameters and fixed observed sequences, provides a probability distribution over alignments. Thus, for a given pair of sequences, one obvious question is: what is the most likely alignment of the two sequences according to the model? Stated formally, the question is to find the maximum a posteriori (MAP) alignment. The naive method of answering this question would be to enumerate all possible alignments, calculate the probability of each, and select the one with highest probability. This quickly becomes infeasible as the number of possible alignments of two sequences grows exponentially in their length. However, by exploiting the structure of the model, we can find the optimal alignment in time proportional to the product of the lengths of the sequences using the Viterbi algorithm.

The Viterbi algorithm for PHMMs is related to the Viterbi algorithm for HMMs, which is itself a specific instance of the generalized distributive law for inference in graphical models (Aji and McEliece, 2000). It can also be viewed as an instance of the dynamic programming approach, a general method for exploiting recursive structure in optimization problems. As is typical of dynamic programming methods, the Viterbi algorithm solves what would appear to be a much more difficult problem: instead of finding the optimal alignment for the two given sequences, the optimal alignment is calculated for every pair of prefixes of the sequences. For a sequence a, let a (i) denote the prefix a 1a 2 …a i and denote the two sequences being compared by a1 and a2. The key observation of the Viterbi algorithm is that in an alignment of a 1(i) and a2(j), the final hidden state must be either H, I, or D and so the alignment decomposes naturally into smaller alignments. This is used to recursively compute a matrix 8 where 8(i, j, s) is the probability of the optimal alignment of a 1(i) and a 2(j) ending in hidden state s. Having computed this matrix, it is then possible to compute the optimal alignment of the entire sequences.

Formally, let X1X2 … XL be the underlying hidden Markov process assuming values in the state space S = {H, I, D}, (where L < N + M is the length of the alignment). The process begins in a state determined by the initial distribution {ni }lk=1 and evolves through the state space according to the transition probabilities sij = Pr(Xl+1 = j Xi = i). The observed sequences a1 and a2 are sequences of random steps of the hidden process using some output distribution bj(a1, am) = Pr(a\, am|Xl = j). Moreover, in order for the model to represent a distribution over all possible pairs of sequences and their alignments, we need to introduce two silent states, Begin and End. We initialize the Viterbi algorithm by

tmp83-68_thumbtmp83-69_thumb

where 0 represents a gap and S(n,m, j) = 0 if n < 0 or m < 0 . The recursion is terminated in the End state by S(N + 1, M + 1, i) = max, {S(N, M, j)j}, and the probability of the most likely state sequence is simply maxi S(N + 1, M + 1, i), call the maximizing state i *. The optimal hidden state sequence, given the data, is retrieved in the traceback by starting in i * and recursively backtracking through argmaxi5(n – k,m – l, i), where (k, l) e {(1, 1), (1, 0), (0, 1)}.

Needleman and Wunsch were the first ones to describe an alignment algorithm for two biological sequences using the dynamic programming approach (Needleman and Wunsch, 1970), and although their paper is focused on a very specific problem, it contains the essence of many alignment methods in use today. It is easy to see that by taking logarithms of the parameters we have described, the PHMM Viterbi algorithm becomes a Needleman-Wunsch algorithm with negative scores. If alignments are normalized, some of the scores will be positive (Bucher and Hofmann, 1996). It is also true, that for any set of Needleman and Wunsch scores, a PHMM can be constructed for which the Viterbi algorithm will yield the same alignment as the Needleman-Wunsch algorithm (even for extensions such as affine gap penalties).

There have been countless improvements to the basic algorithm; we point the reader to text topics such as Durbin et al. (1998) and Gusfield (1997) for detailed expositions. One point of note is that there is a linear-space divide and conquer algorithm for pairwise sequence alignment that is crucial for practical applications where memory is a limiting factor in computing large sequence alignments (Hirschberg, 1975).

4.2. Sampling alignments

One of the advantages of using PHMMs for alignment is that operations such as sampling of alignments can be performed in a statistically meaningful way, for example, by sampling from the posterior distribution on the alignments given the observations. This is not only useful for exploring alternatives to the MAP alignment discussed in the previous section but also for utilizing Bayesian methods for parameter estimation. Fortunately, there is an efficient way to sample alignments, which is also memory efficient (Cawley and Pachter, 2003). The algorithm is not well known, and we therefore include it here for completeness.

Consider the alignment of two sequences of lengths N and M, with a simple PHMM that only assigns probabilities for insertions or deletions, which we call w , and has substitution matrices sij (for observing nucleotides i, j). Let aij denote the sum of the probabilities of all alignments in which position i in one sequence is aligned to position j in the other (i.e., the so-called forward variables). Observe that the aij can be computed using a standard Viterbi-type algorithm with max replaced by sum. The recursion is

tmp83-70_thumb

where i, j > 1 and a0j = aj 0 = wj. Let Vj = (a1j, a2j,… , aNj) be the j th column vector (0 < j < M). The notation vj(i) is used to denote aij. We fix a constant r, and write vr+1 in terms of vr. First, consider the lower-triangular (N + 1) x (N + 1) matrix W where the ij th entry of W is wli-j ‘+1, i > j and wij = 0 otherwise. Similarly, let Sr be the (N + 1) x (N + 1) matrix with 1 on the diagonal, and Sr [ij] = sjr for i = j + 1 and Sr [ij] = 0 otherwise. Observe that recursion (1) can be rewritten as

tmp83-71_thumb

In order to sample efficiently, we want to compute vr-1 in terms of vr:

tmp83-72_thumb

The matrix W is invertible iff w = 0, which is always the case since w = eg for some real number g. All the matrices Sr are invertible. Thus, we can solve for vr-1 given vr. Matrix inversion and multiplication are expensive, however, in our case, the equations yield an effective recursive procedure for computing air because of the special structure of the matrices. It is easily seen that W-1 is a banded matrix with nonzero entries only on the diagonal and off-diagonal, and S-1 is a lower-triangular matrix. By multiplying them, we obtain the following set of equations for computing vr-1 from vr:

tmp83-73_thumb

Thus, the vector vr-1 can be computed in time O (N) from the vector vr by computing the vr-1(k) in the order vr-1(0), vr-1(1),… , vr-1(N). Although it is possible to derive the recursions (3)-(8) directly from equation (1), we have included the matrix derivation since it is useful in generalizing the backward computation of forward variables to other problems.

The sampling algorithm for randomly selecting an alignment path according to its weight is therefore to first compute vM using the forward algorithm variation of the Needleman-Wunsch algorithm (max replaced by sum). This is done using only O (N) memory by keeping only vr-1 in memory for the computation of vr. The samples can then be computed simultaneously by computing vM-1,… , v0 one column at a time and always discarding columns after they have been used (to keep the computations at O(N) memory). The running time of the algorithm for generating k samples is therefore O(NM + k(N + M)). The memory requirement, like the Hirschberg algorithm (Hirschberg, 1975), is only O(N + M).

Although we have presented the algorithm for a simple PHMMs, the same method with minor modifications works for more general models.

5. Extensions

5.1. Generalized pair hidden Markov models and alignment

Traditionally, the problems of gene finding and alignment have been treated separately. However, genes and other functional elements in the genome tend to be conserved between related organisms to a higher extent than nonfunctional regions and so strong alignments are an indicator of possible function for the sequences in the aligned regions. Conversely, inferences about sequence function help construct biologically meaningful alignments. Thus, combining the two holds promise to improve both. At a theoretical level, combining gene prediction with alignment is quite simple: the hidden states of the pair HMM are modified to reflect not only the evolutionary history but also the functionality of the bases, and the output states are modified depending on the functional type as well. So, for example, a hidden state might correspond to matching two codons or an output state could correspond to the specific patterns found in splice sites. However, there are serious practical difficulties with this approach. In a PHMM, the number of steps spent in any given hidden state will be geometrically distributed and so, in the case of gene finding, one would expect the lengths of the predicted exons, unlike the lengths of real exons, to have a geometric distribution. This requires using what is referred to as a generalized PHMM. The development of generalized PHMMs was undertaken in Pachter et al. (2002). The theory was implemented in a software called SLAM (Alexandersson et al., 2003), which aligns and identifies complete exon/intron structures in two related DNA sequences.

5.2. PhyloHMMs and multiple sequence alignment

The hidden Markov model described in Section 3.2. for sequence evolution without gaps can be generalized to multiple sequences related via a phylogeny (McAuliffe, 2004; Siepel and Haussler, 2003). These phyloHMMs can also include generalized states, although this introduces complexities when one allows dependencies between neighboring nucleotides in ancestral sequences. In particular, it may no longer be possible to perform efficient MAP inference. In a different direction, multiple sequence alignment by allowing insertions and deletions in ancestral sequences (without generalized states) leads to interesting models studied by Mitchison (1999), Bruno and Holmes (2001), and others. An interesting future direction for research is the combination of generalized ancestral states with models allowing for insertion and deletion. In other words, it should be interesting to explore generalized phylogenetic Markov Bayesian multinets.

Next post:

Previous post: