Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

of the underlying similarity groups. Apart from narrative descriptions there are two general

avenues to describe similarity groups. Cladograms are classifications that can be

established using proximity measures and represent the internal structure of the similarity

group. Common patterns on the other hand are usually derived from alignments and

represent common substructures present in the members of the similarity group.

The above steps are not always obvious for the users. For example, sequence

similarity search programs present the results corresponding to step II, while some of the 3-

D similarity search servers provide only a qualitative suggestion corresponding to step I.

What is apparent however that all methods include a preliminary, approximate estimation

of similarity, followed by a filtering and finally an alignment step.

This section provides a brief overview of how similarity scoring in used in the

comparison of sequences, protein 3-D structures and entire genomes. In these fields,

similarity measures are used for database searching, for classification and for phylogenetic

analysis. A comprehensive overview of these broad fields would be far beyond the scope of

this chapter. Instead, we will attempt to highlight, using the terminology introduced in the

previous sections, the common themes underlying these three diverse areas.

1. Sequence comparison

Sequences are the simplest descriptions of macromolecules that use residues (amino acids,

nucleotides) as entities and sequential vicinity as the only relationship between them.

Sequence comparison algorithms use essentially the same principle for similarity scoring.

The simple proximity measure is related to the Hamming distance (i.e. no gaps allowed, as

shown in Fig. 3.2). The scoring matrices used in DNA as well as protein comparisons are

constructed in such a way that similar residues give high scores, so the resulting measure

can be called a Hamming similarity measure, rather than a distance. The optimizable

substructure similarity is the string similarity measure (equation 5) in which the position

and number of the gaps as well as the range of alignment is determined by optimization.

The result is a maximal matching, and the alignment score is a local or global maximum

value depending on the algorithm used. Algorithms of global alignment (Needleman-

Wunsch, [1]) or local alignment (FASTA [2], BLAST [3],) have been the subject of several

excellent, recent reviews [see, e.g. [4,5]], the current section focuses on the principles of

scoring, i.e. how a similarity score is transformed into a probabilistic measure.

We will use a simple classification: General methods of comparison use a general

statistical description of random similarities for calculating the significance value to

alignment scores. Specific methods use application-specific descriptions of the biologically

important target groups, such as protein families, domain sequence groups etc. These

groups are often too small for statistics, so specific methods rely instead on additional, a

priori knowledge.

The most frequently used general methods (BLAST [3], FASTA [2], Smith-

Waterman [6]) are based on local sequence alignment. The resulting sequence similarity

scores do not preserve the metric properties (can not be converted into metric distances), on

the other hand they have the advantage that the distribution of random similarities can be

described in an analytical form. This is because scores are maximal values, and the

maximum of a large number of independent identically distributed (i.i.d) random variables

tends to an extreme value (or Gumbell) distribution, just as the sum of a large number of

i.i.d. random variables tends to a normal distribution [7]. The underlying statistics was

described in detail by Karlin and Altschul [8,9] for the BLAST program. Originally, BLAST

used local alignments without gaps called high-scoring segment pair (or HSP ), in which

scores were maximized in the sense that they could not be further improved by extension or

Essays in Bioinformatics

Search WWH ::

Custom Search

Home