Dynamic programming for gene finders (Bioinformatics)

1. Introduction

Dynamic programming (DP), a technique commonly used in bioinformatics algorithms to reduce the evaluation time for recurrence relations, is especially prevalent in the implementation of gene-finding software. The task of gene finding has traditionally been formulated as that of choosing a single parse, or collection of zero or more nonoverlapping gene models, of a sequence such that the chosen parse maximizes some objective function. In general, the number of such parses can grow exponentially with the length of the sequence, owing to the combinatorial nature of the parses – that is, n nonoverlapping and frame-consistent exons can, in theory, be combined independently to produce 2n different gene models. Thus, the explicit enumeration and evaluation of all possible parses of a sequence is generally too computationally expensive to be undertaken in practice. However, many of the objective functions favored by practitioners, particularly those that are probabilistic, can be expressed as a recurrence relation in which the value of a parse is defined recursively in terms of the values of subparses. It is this class of formulations that are amenable to dynamic programming, or DP.

There are two main approaches to DP: bottom-up and top-down. The latter, also called memoization, simply caches the value returned by a recursive call the first time it is made, so that subsequent calls with the same parameters can retrieve the value from the cache rather than recomputing it from scratch. By contrast, the bottom-up approach, which is more commonly used in gene finding and is slightly more efficient, involves the proactive computation of simpler subproblems before they are explicitly needed for the evaluation of more complex subproblems. The values of these subproblems are stored in an array or matrix, with the evaluation order of the cells in the matrix being determined by the recursive structure of the recurrence relation. Once the matrix has been evaluated, the optimal value can be extracted from the appropriate cell in the matrix, and the corresponding optimal solution can be obtained by tracing back through the recurrence links that were used to compute the value of each cell (assuming such links were explicitly stored or can be reconstructed during trace-back) (Cormen etal., 2001; Durbin etal., 1998).


Thus, the main tasks in devising a DP implementation of a new gene-finding strategy are (1) to formulate the objective function as a recurrence relation, (2) to map the recursive structure of that recurrence relation to the elements of a multidimensional matrix, and (3) to determine the evaluation order of the cells in that matrix. Optimization of the objective function is then achieved by evaluating all the cells of the DP matrix and then identifying the optimal cell.

2. Example uses

Examples of the actual use of DP for gene finding are legion, particularly in Marko-vian systems. It is standard to employ DP in both the training and application of Hidden Markov Models (HMMs) (see Article 98, Hidden Markov models and neural networks, Volume 8). The Forward, Backward, and Baum-Welch algorithms for training an HMM involve matrices wherein each cell corresponds to a conditional or joint probability on sequences and their emission by particular states. The Viterbi decoding algorithm for parsing a sequence with an HMM likewise involves a matrix with cells denoting conditionally optimal paths through the states of the HMM and their corresponding conditional probabilities. The global optimum is identified by the cell that pairs the HMM’s designated final state with the end of the sequence (Durbin et al., 1998).

Other uses of DP for gene finding include the decoding algorithms used in applying Generalized Hidden Markov Models (GHMMs) and Nonstationary Markov Chains, and the procedures used to train and apply neural networks (Xu and Uberbacher, 1998; Rumelhart etal., 1986).

3. Practical considerations

Various practical considerations can complicate the task of devising and implementing a DP algorithm. Among the most prominent are those involving space/speed trade-offs. The space requirements of multidimensional DP matrices can grow very quickly as the lengths of the dimensions increase, but often it is possible to eliminate portions of the matrix from consideration right at the outset, thereby permitting the adoption of a “sparse matrix” representation in which memory is allocated only for the portions of the matrix actually needed. Such a representation may or may not improve speed and memory efficiency, but it can also require significantly more effort during the development and maintenance of the software.

Even if the complete matrix is fully allocated in memory, it may still be advantageous to eliminate from consideration certain cells in the matrix that are unlikely to contribute to the optimal solution, similar to “banded” alignment techniques. When the probabilities of these eliminated cells are small but nonzero, the resulting procedure becomes a heuristic that is no longer guaranteed to find the optimal solution. In many cases this might be acceptable, but doing this typically involves empirical studies to establish an acceptable range of values for the cells being eliminated. This may need to be done every time the gene finder is trained for a new organism.

Devising the original recurrence relations can itself be a difficult task (Sankoff, 2000). For any given set of proposed equations, a worthwhile exercise is the derivation of a rigorous proof that the equations exhibit the intended semantics and do not permit greedy behavior. Unintended greediness could allow the algorithm to maximize the objective function as encoded without strictly adhering to the intended semantics, and could thereby compromise prediction accuracy. For example, failure to properly separate the inductive scores of the three coding phases until the end of a CDS in a Markov chain promotes greedy behavior, because the phase is not selected in a globally optimal manner (Wu, 1996; Salzberg, 1998).

Validation of the final software implementation is important as well, but can be very difficult when the expected predictions of a given DP formulation are not precisely known and would be impractical to produce manually. In these cases, subtle programming errors can go undetected unless one undertakes the time-consuming exercise of rigorously tracing through a number of sample computations to verify correct operation (Geigerich, 2000).

Another complication that not uncommonly arises in DP implementations applied to long sequences is the occurrence of numerical underflow, in which a value in the computation becomes so small that it cannot be represented by the host machine, thereby causing a floating-point exception or simply rendering gene models indistinguishable by assigning them all a score of zero. This is especially common in probabilistic formulations involving the multiplication of large numbers of probabilities. Several methods for avoiding underflow have been devised, such as arbitrarily rescaling the numbers or applying the log transformation, but some of these methods, such as rescaling, can significantly complicate the implementation and validation of the program, and generally must be derived anew for each gene finder (Durbin et al., 1998).

In an attempt to counter some of these obstacles, several recent works have attempted to provide tools for the automatic construction of DP algorithms and software, though whether these techniques will see widespread adoption and tangibly impact the field remains to be seen (Geigerich, 2000; Birney and Durbin, 1997).

4. Future challenges

Gene-finding researchers now face an array of challenges for which new DP algorithms need to be developed. As more genomic information becomes available, the successful migration of gene-finding techniques into the realm of comparative genomics becomes increasingly important. The use of models such as Pair HMMs (see Article 17, Pair hidden Markov models, Volume 7) for gene finding entails a potentially significant increase in computational complexity that must be addressed. Recent attempts at using these models have drawn on techniques such as the use of sparse matrices, heuristic optimization, and approximate alignments (themselves relying on DP methods), and generally have involved one or more simplifying assumptions, such as strict conservation of intron/exon structure (Ureta-Vidal et al., 2003).

Much current work now focuses on gene prediction in two species simultaneously; generalizing this to three or more species will require yet additional innovation to handle the enormous computational load. More generally, one might imagine more flexible DP frameworks in which suboptimal parses are made available in an efficient manner for perusal “on demand” by human annotators or for automatic integration with external evidence, perhaps in the context of a tightly coupled ensemble of separate gene-finding programs.

One possible avenue for future progress is in the parallelization of DP algorithms to enable their deployment on multiprocessing hardware. This in itself represents a considerable challenge, as many DP algorithms are inherently difficult to parallelize, due to pervasive dependencies in the recurrence structure.

Next post:

Previous post: