Information Technology Reference
In-Depth Information
compared with knowledge based groups, such as protein families etc. With this approach,
protein domains that are shared among several protein families lead to the merging of
protein family clusters.
A sharp distinction between biologically significant and random similarities is not
possible from the scores alone - such decisions still require a priori knowledge, namely
biological knowledge (e.g. knowledge of the overall domain structure of the protein, the
exon-structure of the genes) as well as a knowledge of the previously known similar
sequences. In addition to the general methods of sequence comparison mentioned above,
there are a number of dedicated specific methods, based on some explicit representation of
biologically important similarity groups such as protein domain sequences. A sequence
similarity group can be represented by a consensus description that represents e.g. a
sequence pattern that is shared by all members of the group. As such patterns can be
obtained by multiple sequence alignments, there is a large variety of algorithms that
represent multiple alignments in terms of consensus sequences, regular expressions,
position-specific scoring matrices or profiles, hidden Markov models (HMMs) or neural
networks (for recent reviews see [5,12]). These consensus descriptions can then be used to
decide whether or not a new query sequence is member of a given similarity group. The
similarity measures used to compare a query with these representations are similar to the
ones described in this review, the details can be found in the original publications as well as
the reviews cited above.
Another group of specific approaches uses a graph-theoretical representation of
similarity groups, which is an exemplar-based description. Sequences within a similarity
group are related to each other by specific similarity (Figure 3.1.), for example each
member of the group is related to at least one other member with a similarity score greater
that a certain threshold [13]. Protein domains are typical examples of well-defined
similarity groups. On the other hand, many of the known proteins are composed of
modules, so the score determined between two such proteins will express the similarity of
the building blocks, rather than that of the two proteins.
The similarities of protein domain groups can be defined on a threshold basis. In the
SBASE protein domain sequence library, a sequence is considered as member of a domain
group if it is similar to at least NSD t members of the group, with an average similarity score
of AVS t where NSD t and AVS t are threshold values automatically determined from a
database vs. database comparison with the BLAST program. A later extension of this
scoring system takes into consideration the distribution of similarity scores in the
neighborhood of each similarity group and uses a probabilistic score. For each raw scores,
four probability values are read from the precomputed distributions shown in Figure 1 , and
the score is derived from the sum of these distributions [14]. From the computational point
of view, this approach is similar to the memory-based computing paradigm [15], the
memory of the system is a database vs. database comparison [16,17].
The approach underlying the COG (Clusters of Orthologous Sequences) databank is
based on grouping sequences together that are mutually the nearest neighbours of each
other in terms of sequence similarity score [18]. Such tight groups or cliques can be
extended to larger similarity groups, which is the basis of identifying orthologous proteins.
This approach is especially successful in prokaryotic genomes in which multidomain
proteins are not abundant.
Recent approaches combine many of the previous concepts. The underlying
philosophy is that database search results should contain all information necessary to find
distant similarities - such as the weak similarities of protein domains - and that these might
be found via a clever sorting of the search results. Namely, the alignment scores (an the P
values) traditionally used to sort the result constitute only one dimension of the sorting.
Search WWH ::




Custom Search