Information Technology Reference
In-Depth Information
of the underlying similarity groups. Apart from narrative descriptions there are two general
avenues to describe similarity groups. Cladograms are classifications that can be
established using proximity measures and represent the internal structure of the similarity
group. Common patterns on the other hand are usually derived from alignments and
represent common substructures present in the members of the similarity group.
The above steps are not always obvious for the users. For example, sequence
similarity search programs present the results corresponding to step II, while some of the 3-
D similarity search servers provide only a qualitative suggestion corresponding to step I.
What is apparent however that all methods include a preliminary, approximate estimation
of similarity, followed by a filtering and finally an alignment step.
This section provides a brief overview of how similarity scoring in used in the
comparison of sequences, protein 3-D structures and entire genomes. In these fields,
similarity measures are used for database searching, for classification and for phylogenetic
analysis. A comprehensive overview of these broad fields would be far beyond the scope of
this chapter. Instead, we will attempt to highlight, using the terminology introduced in the
previous sections, the common themes underlying these three diverse areas.
1. Sequence comparison
Sequences are the simplest descriptions of macromolecules that use residues (amino acids,
nucleotides) as entities and sequential vicinity as the only relationship between them.
Sequence comparison algorithms use essentially the same principle for similarity scoring.
The simple proximity measure is related to the Hamming distance (i.e. no gaps allowed, as
shown in Fig. 3.2). The scoring matrices used in DNA as well as protein comparisons are
constructed in such a way that similar residues give high scores, so the resulting measure
can be called a Hamming similarity measure, rather than a distance. The optimizable
substructure similarity is the string similarity measure (equation 5) in which the position
and number of the gaps as well as the range of alignment is determined by optimization.
The result is a maximal matching, and the alignment score is a local or global maximum
value depending on the algorithm used. Algorithms of global alignment (Needleman-
Wunsch, [1]) or local alignment (FASTA [2], BLAST [3],) have been the subject of several
excellent, recent reviews [see, e.g. [4,5]], the current section focuses on the principles of
scoring, i.e. how a similarity score is transformed into a probabilistic measure.
We will use a simple classification: General methods of comparison use a general
statistical description of random similarities for calculating the significance value to
alignment scores. Specific methods use application-specific descriptions of the biologically
important target groups, such as protein families, domain sequence groups etc. These
groups are often too small for statistics, so specific methods rely instead on additional, a
priori knowledge.
The most frequently used general methods (BLAST [3], FASTA [2], Smith-
Waterman [6]) are based on local sequence alignment. The resulting sequence similarity
scores do not preserve the metric properties (can not be converted into metric distances), on
the other hand they have the advantage that the distribution of random similarities can be
described in an analytical form. This is because scores are maximal values, and the
maximum of a large number of independent identically distributed (i.i.d) random variables
tends to an extreme value (or Gumbell) distribution, just as the sum of a large number of
i.i.d. random variables tends to a normal distribution [7]. The underlying statistics was
described in detail by Karlin and Altschul [8,9] for the BLAST program. Originally, BLAST
used local alignments without gaps called high-scoring segment pair (or HSP ), in which
scores were maximized in the sense that they could not be further improved by extension or
Search WWH ::




Custom Search