Information Technology Reference
In-Depth Information
Mitchel [57] developed a graph-like representation, using secondary structural
elements as nodes, and angles and distances as edges, the largest common substructure was
then identified by subgraph isomorphism algorithm developed by Grindley et al. [58].
Harrison and associates [59] further developed this approach and introduced a similarity
measure S grath based on the number of SSEs and residues in two proteins and in the largest
common subgraph:
§
·
§
·
5
CS
CS
Min
(
R
1
R
2
)
CR
1
CR
2
[9]
¨
©
¸
¹
¨
©
¸
¹
S grath
2
8
SS
1
SS
2
Max
(
R
1
R
2
)
R
1
R
2
where the two proteins compared have SS1 and SS2 secondary structures and R1 and R2
amino acid residues, respectively, their comparison generates a largest clique of size CS .
The largest clique is produced from a set of secondary structures that contain a total of CR1
and CR2 residues in protein 1 and protein 2, respectively. This similarity measure is
reported to be independent of fold size and was used to characterize the fold space
represented by the CATH database [59].
Finally, there are methods that do not use superposition but define simple similarity
scores instead. PRIDE [60] is based on the distribution of intramolecular C D -C D distances
incorporated into a set of histograms for C D pairs separated by 3 to 30 residues. The
comparison between two proteins is thus reduced to the comparison of 28 distribution pairs,
which can be carried out by a standard statistical method of contingency table analysis and
yields a probability value. The average value of these 28 single similarity scores was
defined as the Probability of identity or PRIDE score [60]. Pride has a value between 0 and
1, and has metric properties which makes it suitable for clustering large datasets. The
calculation is extremely fast (perhaps the fastest available today), database search and fold
assignment, clustering of structures are possible on line. When used as a simple nearest
neighbor classifier, PRIDE reaches 99.5% success in fold recognition, based on the C,A,T,
H classes of the CATH database. This method is available via a web server at ICGEB.
Another recent, fast comparison method by [61] uses a vector representation of
protein folds which is based on topological invariants called Gauss integrals, each
representing a topological property of the backbone space curve [62]. 30 such integrals are
calculated for two proteins, which are then compared in terms of a 30 dimensional
Euclidean distance. A classifier built on Gauss integrals has a reported accuracy of 96.8%
on the C,A,T classes of the CATH database [61].
3. Genomes, proteomes, networks
Designing representations for genomes, proteomes and networks is a real challenge and we
are only at the first steps of this new era. The representations in current use follow the
entity-relationship tradition, for example genomes are represented as linear array of genes
and other DNA segments. The entities - genes - are predicted with gene-prediction
programs or are determined experimental methods, and this adds a new layer of knowledge
to the molecular data. The relationships are manifold but are predominantly binary in
nature. Examples of relations include physical vicinity, distance along the chromosome,
regulatory links extracted from DNA chip data and so on. The resulting picture is a graph of
several ten-thousand nodes and relatively few edges per node denoting various
relationships. The description of proteomes is only somewhat different. The proteins are
described in functional, biochemical and structural terms, and the relationships between
proteins include metabolic relationships (sharing substrates in metabolic pathways) as well
as structural relationships (sequence and structural similarities). Even this sketchy
Search WWH ::




Custom Search