Pattern Matching - Bioinformatics Computing

Biomedical Engineering Reference

In-Depth Information

Fundamentals

Sequence alignment is fundamental to inferring homology (common ancestry) and function. For

example, it's generally accepted that if two sequences are in alignment—part or all of the pattern of

nucleotides or polypeptides match—then they are similar and may be homologous. Another heuristic

is that if the sequence of a protein or other molecule significantly matches the sequence of a protein

with a known structure and function, then the molecules may share structure and function. The

issues related to single pairwise sequence alignment, global versus local alignment, and multiple

sequence alignment are introduced here.

Pairwise Sequence Alignment

Pairwise sequence alignment involves the matching of two sequences, one pair of elements at a time.

The challenge in pairwise sequence alignment is to find the optimum alignment of two sequences

with some degree of similarity. This optimum condition is typically based on a score that reflects the

number of paired characters in the two sequences and the number and length of gaps required to

adjust the sequences so that the maximum number of characters are in alignment. For example,

consider the ideal case of two identical nucleotide sequences, (A) and (B):

A) ATTCGGCATTCAGTGCTAGA

B) ATTCGGCATTCAGTGCTAGA

Assuming that the alignment scoring algorithm counts one point per pair of aligned characters

(shown in bold type), then the score is one point for each of the 20 pairs, or 20 points. Now, consider

the case when several of the character pairs aren't aligned:

C) ATTCGGCATT CAGT G CTAGA

D) ATTCGGCATT GCTA G A

In this case, the score would be 11, because only 11 pairs of characters in sequences (C) and (D) are

aligned. However, by examining the end of the sequences, it can be seen that the sequence of the

last six characters are identical. By moving these last six characters ahead in sequence (D) by adding

four spacers or gaps, the sequences become:

E) ATTCGGCATT CAGT GCTAGA

F) ATTCGGCATT----GCTAGA

Now the score, based on the original algorithm of character pairings, is 16. However, because the

score would have been 11 without the inserted gaps, a penalty should be extracted for each gap

inserted into the sequence to favor alignments that can be made with as few gaps as possible.

Assuming a gap penalty of -0.5 per gap, the alignment score becomes 10 + 6 + (4 x -0.5) or 14.

A more likely scenario is one in which the areas of similarity and difference are not obvious. Consider

the sequences (G) and (H):

Search WWH ::

Custom Search

Home