Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

compared with knowledge based groups, such as protein families etc. With this approach,

protein domains that are shared among several protein families lead to the merging of

protein family clusters.

A sharp distinction between biologically significant and random similarities is not

possible from the scores alone - such decisions still require a priori knowledge, namely

biological knowledge (e.g. knowledge of the overall domain structure of the protein, the

exon-structure of the genes) as well as a knowledge of the previously known similar

sequences. In addition to the general methods of sequence comparison mentioned above,

there are a number of dedicated specific methods, based on some explicit representation of

biologically important similarity groups such as protein domain sequences. A sequence

similarity group can be represented by a consensus description that represents e.g. a

sequence pattern that is shared by all members of the group. As such patterns can be

obtained by multiple sequence alignments, there is a large variety of algorithms that

represent multiple alignments in terms of consensus sequences, regular expressions,

position-specific scoring matrices or profiles, hidden Markov models (HMMs) or neural

networks (for recent reviews see [5,12]). These consensus descriptions can then be used to

decide whether or not a new query sequence is member of a given similarity group. The

similarity measures used to compare a query with these representations are similar to the

ones described in this review, the details can be found in the original publications as well as

the reviews cited above.

Another group of specific approaches uses a graph-theoretical representation of

similarity groups, which is an exemplar-based description. Sequences within a similarity

group are related to each other by specific similarity (Figure 3.1.), for example each

member of the group is related to at least one other member with a similarity score greater

that a certain threshold [13]. Protein domains are typical examples of well-defined

similarity groups. On the other hand, many of the known proteins are composed of

modules, so the score determined between two such proteins will express the similarity of

the building blocks, rather than that of the two proteins.

The similarities of protein domain groups can be defined on a threshold basis. In the

SBASE protein domain sequence library, a sequence is considered as member of a domain

group if it is similar to at least NSD t members of the group, with an average similarity score

of AVS t where NSD t and AVS t are threshold values automatically determined from a

database vs. database comparison with the BLAST program. A later extension of this

scoring system takes into consideration the distribution of similarity scores in the

neighborhood of each similarity group and uses a probabilistic score. For each raw scores,

four probability values are read from the precomputed distributions shown in Figure 1 , and

the score is derived from the sum of these distributions [14]. From the computational point

of view, this approach is similar to the memory-based computing paradigm [15], the

memory of the system is a database vs. database comparison [16,17].

The approach underlying the COG (Clusters of Orthologous Sequences) databank is

based on grouping sequences together that are mutually the nearest neighbours of each

other in terms of sequence similarity score [18]. Such tight groups or cliques can be

extended to larger similarity groups, which is the basis of identifying orthologous proteins.

This approach is especially successful in prokaryotic genomes in which multidomain

proteins are not abundant.

Recent approaches combine many of the previous concepts. The underlying

philosophy is that database search results should contain all information necessary to find

distant similarities - such as the weak similarities of protein domains - and that these might

be found via a clever sorting of the search results. Namely, the alignment scores (an the P

values) traditionally used to sort the result constitute only one dimension of the sorting.

Essays in Bioinformatics

Search WWH ::

Custom Search

Home