Comparison of sequences, protein 3D structures and genomes - Essays in Bioinformatics

Information Technology Reference

In-Depth Information

Mitchel [57] developed a graph-like representation, using secondary structural

elements as nodes, and angles and distances as edges, the largest common substructure was

then identified by subgraph isomorphism algorithm developed by Grindley et al. [58].

Harrison and associates [59] further developed this approach and introduced a similarity

measure S grath based on the number of SSEs and residues in two proteins and in the largest

common subgraph:

Min

(

)

[9]

S grath

Max

(

)

where the two proteins compared have SS1 and SS2 secondary structures and R1 and R2

amino acid residues, respectively, their comparison generates a largest clique of size CS .

The largest clique is produced from a set of secondary structures that contain a total of CR1

and CR2 residues in protein 1 and protein 2, respectively. This similarity measure is

reported to be independent of fold size and was used to characterize the fold space

represented by the CATH database [59].

Finally, there are methods that do not use superposition but define simple similarity

scores instead. PRIDE [60] is based on the distribution of intramolecular C D -C D distances

incorporated into a set of histograms for C D pairs separated by 3 to 30 residues. The

comparison between two proteins is thus reduced to the comparison of 28 distribution pairs,

which can be carried out by a standard statistical method of contingency table analysis and

yields a probability value. The average value of these 28 single similarity scores was

defined as the Probability of identity or PRIDE score [60]. Pride has a value between 0 and

1, and has metric properties which makes it suitable for clustering large datasets. The

calculation is extremely fast (perhaps the fastest available today), database search and fold

assignment, clustering of structures are possible on line. When used as a simple nearest

neighbor classifier, PRIDE reaches 99.5% success in fold recognition, based on the C,A,T,

H classes of the CATH database. This method is available via a web server at ICGEB.

Another recent, fast comparison method by [61] uses a vector representation of

protein folds which is based on topological invariants called Gauss integrals, each

representing a topological property of the backbone space curve [62]. 30 such integrals are

calculated for two proteins, which are then compared in terms of a 30 dimensional

Euclidean distance. A classifier built on Gauss integrals has a reported accuracy of 96.8%

on the C,A,T classes of the CATH database [61].

3. Genomes, proteomes, networks

Designing representations for genomes, proteomes and networks is a real challenge and we

are only at the first steps of this new era. The representations in current use follow the

entity-relationship tradition, for example genomes are represented as linear array of genes

and other DNA segments. The entities - genes - are predicted with gene-prediction

programs or are determined experimental methods, and this adds a new layer of knowledge

to the molecular data. The relationships are manifold but are predominantly binary in

nature. Examples of relations include physical vicinity, distance along the chromosome,

regulatory links extracted from DNA chip data and so on. The resulting picture is a graph of

several ten-thousand nodes and relatively few edges per node denoting various

relationships. The description of proteomes is only somewhat different. The proteins are

described in functional, biochemical and structural terms, and the relationships between

proteins include metabolic relationships (sharing substrates in metabolic pathways) as well

as structural relationships (sequence and structural similarities). Even this sketchy

Essays in Bioinformatics

Search WWH ::

Custom Search

Home