Multiple Sequence Alignment in Parallel on a Cluster ofWorkstations - Parallel Computing for Bioinformatics and Computational Biology

Biomedical Engineering Reference

In-Depth Information

A pairwise alignment is performed using the dynamic programming algorithm of

Needleman and Wunsch [11], as modified by Gotoh [3] to obtain O(m 2 ) performance,

where m is the length of the sequences being aligned. For aligning two sequences S

and T , the process involves filling in a matrix that represents all possible alignments

of the two sequences. If S i is the i th amino acid of protein S and T j is the j th amino

acid of protein T , then the cell A(i , j) represents a particular alignment of the two

sequences, where S i is aligned with T j . The matrix is also used to penalize gaps in the

alignment, to encourage fewer and shorter gaps. In general, there are many different

ways of assigning scores for gap , match , and mismatch ; see, for example, the article

by Thompson et al. [15] for a detailed discussion of score assignments in CLUSTAL

W. In the example given subsequently, we assume that a gap is assigned a score of

−

1). The aim is to

find an alignment that has a maximum score. Each step in the dynamic programming

algorithm can be summarized as follows:

2 and a match (or mismatch) is assigned a score of

1 (or

−

Each entry A(i , j) can be filled in three ways [ A(i

−

1, j

−

1 ) , A(i

−

1, j) , and

•

A(i , j

1 ) are the three elements immediately preceding A(i , j) in the matrix]

1. The last nucleotide of S(i) is aligned with T( j) . There are two pos-

sibilities, either they match (

−

1) or they mismatch (

−

1). The cost is

2. S(i) is aligned with a gap in T . The cost is A(i

A(i

−

1, j

−

1 )

−

1, j)

−

3. T( j) is aligned with a gap in S . The cost is A(i , j

−

1 )

−

In each case, the rest of prefix of S should be optimally aligned with the rest of

prefix of T .

•

We can write these three possibilities in a more mathematical form as follows:

A(i

−

1, j)

−

align S(i) with a gap

A(i , j)

max

A(i , j

−

1 )

−

align T( j) with a gap

A(i

−

1, j

−

1 )

align S(i) with T(j)

Hence, we need the three entries A(i

1 ) for filing

the entry A(i , j) . Note that all these three entries have been already computed when

we are trying to compute A(i , j) .

The time to fill the entire matrix is O(m

−

1, j) , A(i , j

−

1 ) , and A(i

−

1, j

−

n) when the two sequences are of

length m and n , respectively. The cost of the optimal alignment is the cost in the

A(m , n) th entry of the dynamic programming matrix. At every step, we can keep back

pointers to indicate where the optimal cost came from. Hence, we can determine the

optimal alignment by following these back pointers when the dynamic programming

is completed.

The value of the A(m , n) th entry in this matrix is the distance or score for the optimal

alignment of the two sequences; the higher the score, the closer the relationship

between the two sequences. This is because a higher score can be achieved only if a

large number of residues or bases in the two strings match. Recall that only matches

Parallel Computing for Bioinformatics and Computational Biology

Search WWH ::

Custom Search

Home