Heuristic Alignment Methods - Multiple Sequence Alignment Methods

Biology Reference

In-Depth Information

alignment under the sum-of-pairs scoring system. Gotoh [ 61 ] also

succeeded in acceleration of the algorithm when N is large based on

the “generalized profile” data structure, where “generalized”means

that various lengths of internal gaps, as well as ordinary residues,

are treated in the form of profile. The difficult point is to compute

G ( p , q , l ) exactly. In the worst case, this problem is proven to be

NP-complete [ 62 ], but in practice not much more than O ( NL 2 )

operations are sufficient to reach the optimality [ 63 , 64 ]. However,

this type of exact method is still considerably slower than approxi-

mate methods that use only the ordinary profile vectors. Except for

PRRN/PRRP [ 52 , 63 ] and OPAL [ 43 ], currently popular MSA

programs, such as ClustalW [ 36 ], MAFFT [ 40 , 65 ] and MUSCLE

[ 53 ], adopt a fast approximate method to evaluate gap open penal-

ties. An intermediate approach that calculates G ( p , q , l ) exactly but

omits the branch-and-bound extension is employed in PRIME

[ 66 ] to introduce double affine gap penalties in MSA. T-Coffee

[ 24 ], ProbCons [ 19 ], and their derivatives are not bothered with

the procedure to evaluate gap penalties in MSA, as gaps are not

penalized at the progressive and refinement stages.

Note that many column scores other than the sum-of-pairs

column score (the first term in the square brackets in Eq. 4 ) are

compatible with progressive alignment and iterative refinement.

In fact, MUSCLE [ 41 ] and HHsearch [ 67 ] (a component of

Clustal

) use different implementations of column-specific log

odds score. Two studies [ 68 , 69 ] on relative performance of various

column scores indicated that most of the scores they examined

behave similarly to one another in alignment accuracy. However,

further studies might be necessary on this point [ 70 ]. Meanwhile,

Edgar [ 71 ] recently noticed that VTML amino acid substitution

matrix [ 72 ] works best for accurate MSA construction by MUS-

CLE under the optimized gap penalties.

At the step (2), the existingMSA is horizontally divided into two

groups. As there are 2 N 1

Ω

1 possible ways for division, trials of all

of them [ 57 , 58 ] are time consuming. Hirosawa et al. [ 73 ] examined

variousmethods of partitioning, and reached the conclusion that tree-

restricted partitioning (partitioning at each edge of the guide tree) is

best in balance between accuracy and computational costs. Gotoh

[ 74 ] also reached the same partitioning scheme fromanother point of

view, i.e., this partitioning ideally fit calculation of WSP with the pair

weights w p , q obtained by the three-way method, an approximation of

the Rational II model of Altschul [ 60 ]. Weighting sequences or

sequence pairs will be discussed in the last subsection in more detail.

The tree-restricted partitioning is the default iterative refinement

strategy of PRRN [ 52 ], MAFFT [ 65 ] and MUSCLE [ 53 ].

Fast homology search programs such as BLAST [ 75 ] and BLAT

[ 76 ] generally adopts seed and extension strategies to find high-

scoring segment pairs (HSPs). The concept of HSP is easily

2.5 Anchoring

Multiple Sequence Alignment Methods

Search WWH ::

Custom Search

Home