Finding genes in genomic DNA sequences is a fundamental step in genome analysis. Spliced alignment is an effective method for finding multiexon genes for which a similar cDNA or protein sequence is available. A cDNA sequence is either of full length or of partial length such as an expressed sequence tag (EST). The coordinates of exons that are similar to a cDNA or protein sequence are located on the basis of sequence similarity and splice site strength. This approach is very accurate if exons are highly similar to a cDNA or protein sequence.
In this chapter, we first present a mathematical formulation of the spliced alignment problem such that a mathematically optimal solution is likely to be exactly correct in exon coordinates. Both sequence similarity and splice site strength are included in the formulation. Then, we describe a dynamic programming algorithm for computing an optimal spliced alignment of a genomic sequence and a cDNA sequence. We briefly discuss spliced alignment methods that explore protein sequence similarity. Next, we present a fast method for dealing with a large set of genomic sequences and a set of cDNA sequences. We also describe methods used in existing spliced alignment programs and comment on their strengths. Finally, we offer suggestions for future improvements in spliced alignment.
2. An alignment model
A similarity relationship between a genomic sequence A and a cDNA sequence B is represented by an alignment of the two sequences, an ordered list of pairs of letters of A and B. The alignment consists of substitutions, deletion gaps, and insertion gaps. A substitution pairs a letter of A with a letter of B .A substitution is a match if the two letters are identical and a mismatch otherwise. A deletion gap is a gap in which letters of A correspond to no letter of B, and an insertion gap is a gap in which letters of B correspond to no letter of A. The length of a gap is the number of letters involved. Deletion and insertion gaps are defined with regard to transformation of sequence A into sequence B. An alignment of A and B shows a way to transform A into B, where a letter of A is replaced by a letter of B in every substitution, the letters of A in every deletion gap are deleted, and the letters of B in every insertion gap are inserted.
Figure 1 An alignment of a genomic DNA sequence and a cDNA sequence. A match is indicated by a vertical bar, a mismatch is indicated by a blank, each intron deletion gap is indicated by a number of plus symbols, and other gaps are indicated by a number of dash symbols. This alignment contains 24 matches, 1 mismatch, an exon deletion gap of length 2, an insertion gap of length 1, and three intron deletion gaps of lengths 7, 16, and 11
There are two types of deletions gaps: exon deletion gaps and intron deletion gaps. Exon deletion gaps are introduced to represent minor differences between exon regions of A and regions of B, whereas intron deletion gaps are intended to represent major differences involving introns and other regions of A that do not correspond to any part of B .An alignment of a genomic DNA sequence and a cDNA sequence is shown in Figure 1.
The similarity of an alignment is measured by a numerical score. Let o(a,b) be the score of a substitution involving letters a and b. Let numbers q and r be gap-open and gap-extension penalties respectively. The numbers q and r are nonnegative. The score of an insertion or exon deletion gap of length h is —(q + h x r). Let k be a minimum cutoff on the lengths of intron deletion gaps. Let d = q + k x r be the penalty for a gap of length k. Let do(i) denote the prediction score of a candidate donor site starting at position i of A, where position i — 1 belongs to a candidate exon region and position i belongs to a candidate intron region. Let ac(i) denote the prediction score of a candidate acceptor site ending at position i of A, where position i belongs to a candidate intron region and position i + 1 belongs to a candidate exon region. The score of an intron deletion gap involving a region from positions is to ie of A is do(is) — d + ac(ie). The similarity score of an alignment is just the sum of scores of each substitution and each gap in the alignment.
Values for the parameters o, k, q, and r are specified by the user. A letter-independent substitution table is usually used for comparison of DNA and cDNA sequences. For example, each match is given a score of 10 and each mismatch a score of —20. Possible values for q and r are 40 and 2 respectively. For candidate donor sites with the GT dinucleotide and candidate acceptor sites with the AG dinucleotide, which are called canonical splice sites, their prediction scores are obtained by using a statistical method (Zhang and Marr, 1993; Salzberg, 1997; Brendel and Kleffe, 1998) and scaling resulting scores to a range from —c x r to c x r, where c is a positive integer constant. For the remaining candidate donor and acceptor sites, their prediction scores are —c x r. For example, a value of 5 is used for both c and k, each canonical candidate site is given a score of 10, and each noncanonical candidate site is given a score of —10. The score of the alignment in Figure 1 is —36 under the parameter values given above.
An optimal spliced alignment of two sequences A and B is an alignment of A and B with the maximum score. In the optimal alignment, exon regions of A are likely to be aligned with regions of B, and long introns and intergenic regions of A are likely to be treated as intron deletion gaps because intron deletions gaps are given a constant penalty. Exon-intron boundaries in A are likely to be exactly shown on the optimal alignment because the prediction scores of donor and acceptor sites are used in the definition of the optimal alignment.
3. A dynamic programming algorithm
Let A = a1 a2 … am be a genomic DNA sequence of length m and B = b1 b2 … bn be a cDNA sequence of length n. A dynamic programming algorithm is developed to compute an optimal spliced alignment of A and B. Let Ai = a i a 2 … ai and Bj = bi b2 … bj be the initial segments of lengths i and j of A and B. In the algorithm, a matrix S is introduced: S (i,j) is the maximum score of all alignments of Ai and Bj. Thus, S(m,n) is the score of an optimal spliced alignment of A and B. To compute the matrix S efficiently for i > 0 and j > 0, partition the set of all alignments of Ai and Bj into four groups. Group 1 consists of alignments of Ai and Bj ending with a substitution. Group 2 consists of alignments of Ai and Bj ending with an exon deletion gap. Group 3 consists of alignments of Ai and Bj ending with an insertion gap. Group 4 consists of alignments of Ai and Bj ending with an intron deletion gap.
Three additional matrices ED, ID, and I are introduced for groups 2 through 4. Define ED(i,j) (ED for exon deletion) to be the maximum score of alignments in group 2. Define I(i,j) (I for insertion) to be the maximum score of alignments in group 3. Define ID (i,j) + ac (i) (ID for intron deletion) to be the maximum score of alignments in group 4. Note that the score of each alignment in group 4 includes the term ac (i) as part of the score of an intron deletion gap ending at position i. A partial score of an alignment in group 4 is the score of the alignment minus ac(i). Thus, ID(i,j) is the maximum partial score of alignments in group 4. The introduction of the partial score leads to a simple formula for computing ID (ij).
On the basis of the definitions of the matrices given above, the recurrences for computing the matrices S, ED, I, and ID can be developed. The correctness proof for the recurrences is omitted. The complete recurrences for the matrices are given below.
The matrices can be computed in order of rows or columns. The value S(m,n) is the score of an optimal spliced alignment of A and B. If only the score S (m,n) is needed, then three linear arrays and a few scalars are sufficient to carry out the computation. This algorithm is an extension to standard global alignment algorithms (Needleman and Wunsch, 1970; Sellers, 1974; Wagner and Fischer, 1974; Waterman etal., 1976; Gotoh, 1982).
An optimal alignment is found by a traceback procedure through the matrices S, ED, I, and ID. The aligned pairs of an optimal alignment are generated in a reverse order with the last aligned pair produced first. The traceback procedure is based on the following observations. Let R(Am, Bn) denote an optimal alignment of A and B. If S(m, n) = S(m — 1,n — 1) + o(am, bn), then the last aligned pair of R(Am, Bn) is a substitution pair. Otherwise, if S(m, n) = ED(m, n), then thelast aligned pair is an exon deletion pair. Otherwise, if S(m,n) = I(m,n), then the last aligned pair is an insertion pair. Otherwise, we have S(m, n) = ID (m, n), and the last aligned pair is an intron deletion pair. Next, the procedure is repeated to produce the aligned pairs of the portion of R(Am, Bn) before the last pair.
Specifically, the traceback procedure consists of four cases, one for each matrix. Initially, the current entry is (m,n) in S. First, consider the case in which the current entry is (ij) in S .If i = 0 and j = 0, then the traceback procedure terminates. Otherwise, if i > 0 and S(i, j) = ED (i, j), then the current entry becomes (ij) in ED. Otherwise, if j > 0 and S(i, j) = I(i, j), then the current entry becomes (ij) in I. Otherwise, if i > 0 and S(i, j) = ID (i, j), then the current entry becomes (ij) in ID. Otherwise, the current entry becomes (i — 1, j — 1) in S and a substitution pair (ai, bj) is reported.
Then, consider the case in which the current entry is (i,j) in ED. An exon deletion pair (ai, —) is reported. If i > 1 and ED (i, j) = ED (i — 1, j) — r, then the current entry becomes (i — lj) in ED. Otherwise, the current entry becomes (i — lj) in S.
Next, consider the case in which the current entry is (i,j) in I .An insertion pair (—, bj) is reported. If j > 1 and I(i, j) = I(i, j — 1) — r, then the current entry becomes (ij — 1) in I. Otherwise, the current entry becomes (ij — 1) in S.
Finally, consider the case in which the current entry is (i,j) in ID. An intron deletion pair (ai, —) is reported. If i > 1 and ID(i, j) = ID(i — 1, j), then the current entry becomes (i — 1j) in ID. Otherwise, the current entry becomes (i — 1,j) in S.
The traceback procedure requires that the complete matrices be saved or additional information be saved to indicate how the value at each matrix entry is generated, which takes computer memory proportional to the product m x n. For example, on a genomic sequence of 100 000 bp and a cDNA sequence of 1000 bp, the algorithm requires about 100-400 MB of memory. The time requirement of the algorithm is also proportional to the product m x n. For example, the algorithm requires less than 1 min on an ordinary computer to produce an optimal alignment of a genomic sequence of 100 000 bp and a cDNA sequence of 1000 bp. The space requirement of this algorithm is much more limiting than the time requirement.
An optimal alignment of a genomic sequence and a cDNA sequence can be computed in computer memory proportional to the sum of the sequence lengths by a divide-conquer technique (Hirschberg, 1975; Myers and Miller, 1988). The main idea of this technique is to determine a middle pair of positions on an optimal alignment in linear space. Then, the portions of the optimal alignment before and after the middle pair of positions are constructed recursively. This space-efficient technique is used in several programs for computing spliced alignments (Mott, 1997; Huang et al., 1997; Usuka et al., 2000).
We would like to point out that a dynamic programming algorithm could be developed for a version of the spliced alignment problem based on protein sequence similarity. The algorithm is able to produce, in presence of frameshifts, a strongest similarity relationship between codons of a genomic sequence and amino acids of a protein sequence. There are a few existing spliced alignment programs based on protein sequence similarity (Gelfand et al., 1996; Huang and Zhang, 1996; Usuka and Brendel, 2000).
4. Fast comparison of genomic and cDNA sequences
A general strategy for comparing a large set of genomic sequences with a set of cDNA sequences proceeds in two steps. In step 1, short word matches between genomic and cDNA sequences are quickly located through a lookup table (Pearson and Lipman, 1988; Altschul et al., 1990). Each word match is extended into a high-scoring segment pair (HSP), an ungapped alignment. Then, HSPs from the same pair of genomic and cDNA sequences are combined into chains (Wilbur and Lipman, 1983; Chao and Miller, 1995; Huang, 1996), where a chain of HSPs is an approximation to an alignment, with each HSP being an ungapped portion of the alignment. In step 2, for each high-scoring chain between a region of a genomic sequence and a cDNA sequence, a small area of the dynamic programming matrix is located to cover all the HSPs in the chain. Then, a largest-scoring alignment of the genomic region and the cDNA sequence is produced by restricting the dynamic programming computation to the area.
Variations of this general strategy are used in the following programs: est2genome (Mott, 1997), DDS/GAP2 (Huang etal., 1997), sim4 (Florea etal., 1998), GeneSeqer (Usuka et al., 2000), Spidey (Wheelan et al., 2001), and BLAT (Kent, 2002).
The est2genome program works in three steps. In step 1, a genomic sequence is compared with a set of cDNA sequences with BLAST (Altschul et al., 1990). For each cDNA sequence similar to the genomic sequence, the start and end positions of an optimal local alignment are computed by the Smith-Waterman algorithm (Smith and Waterman, 1981). Then, the corresponding regions of the genomic and cDNA sequences are extracted and an optimal alignment of the two regions is computed by a dynamic programming algorithm in linear space. The GT-AG dinucleotides are used to locate splice sites. The est2genome program is good at computing cross-species alignments.
The DDS program identifies regions of the genomic sequence that are similar to a cDNA sequence in the database, on the basis of step 1 of the general strategy described above. The GAP2 program computes an alignment for each pair of genomic region and cDNA sequence by using a linear-space dynamic programming algorithm based on the GT-AG dinucleotides. The DDS/GAP2 program is suitable for computing cross-species alignments.
The sim4 program works in four steps. In step 1, perfect word matches of length 12 between genomic and cDNA sequences are extended into HSPs. In step 2, HSPs are combined into chains. Each exon core is represented by a set of close HSPs. In step 3, the boundaries of exon regions in the genomic and cDNA sequences are determined by a speedy method based on sequence similarity and the GT-AG dinucleotides. In step 4, for each pair of exon regions of the genomic and cDNA sequences, an alignment of the exon regions is produced by a fast method (Chao et al., 1997). A main strength of sim4 is its ability to perform cross-species comparisons at a great speed.
In the GeneSeqer program, pairs of similar genomic regions and cDNA sequences are found by extending word matches of length 12 into HSPs and combining HSPs into chains. Then, for each pair of similar genomic region and cDNA sequence, an optimal alignment is computed by scoring for both sequence similarity and potential splice site strength. The strengths of splice sites are scored by using a statistical method (Brendel and Kleffe, 1998) instead of the GT-AG dinucleotides. A main strength of GeneSeqer is its ability to find short exons on the basis of their splice site strength.
The Spidey program finds exon-intron boundaries in a genomic sequence by repeatedly performing fast searches with BLAST. Initially, HSPs between the genomic sequence and cDNA sequences are computed by BLAST at a high stringency (e = 10—6). Then, regions of the genomic sequence are located such that all HSPs in a region are nonoverlapping and linearly consistent. Next, for each genomic region, HSPs between the genomic region and the cDNA sequence are computed by BLAST at a lower stringency (e = 10—3). For each gap between HSPs, additional HSPs are computed by BLAST at a low stringency (e = 1). Close HSPs are merged into alignments such that each alignment corresponds to one exon. Finally, potential splice sites around the ends of the alignments are scored and selected. A main strength of Spidey is that its performance is not affected by wide variations in intron size.
In the BLAT program, a lookup table is constructed for all nonoverlapping words of length k in the set of genomic sequences, where k is an integer between 8 and 16. Then, the set of cDNA sequences is scanned to locate perfect or near perfect word matches between genomic and cDNA sequences. If there are a sufficient number of word matches between a genomic sequence and a cDNA sequence, then those word matches are extended into HSPs. Close HSPs are linked together into an alignment by using a fast method based on sequence similarity and the GT-AG dinucleotides. BLAT is also able to speed up the comparison by using many processors. A main strength of BLAT is its ability to handle huge sets of genomic and cDNA sequences from the same species.
Four existing programs, DDS/GAP2, est2genome, sim4, and GeneSeqer, were used and compared by Haas et al. (2002) in a rigorous reannotation effort to map 5016 full-length cDNA sequences from Arabidopsis to its genome. The four programs produced identical splice sites for 4918 of the 5016 cDNA sequences; the programs disagreed on less than 2% of the cDNA sequences. However, the programs are different in their strengths. DDS/GAP2 is better at matching the entire cDNA to the genome than any of the other programs. GeneSeqer is excellent at finding short exons of 3 -25 bp. SIM4 is much faster than any of the other programs.
5. Future developments
We suggest three issues for future studies: identification of short exons, construction of a weight array matrix for noncanonical splice sites, and comparison of cross-species sequences. Short exons are difficult to find because short exon matches are much less significant than long exon matches in assessment of exons. This issue may be addressed through development of a special method to put more weight on splice site strength. As more noncanonical splice sites are known, it may be possible to construct a weight array matrix for noncanonical splice sites. If the matrix is useful for finding noncanonical splice sites, then the matrix can be incorporated in existing alignment algorithms. The full matrix versions of dynamic programming algorithms are sensitive in computing cross-species alignments but are very slow for large-scale applications. It may be possible to improve the efficiency of those algorithms without a significant loss in sensitivity.