Information Technology Reference
In-Depth Information
introduction implies that we deal with new a kind of complexity that originates, from the
numerous and to a large extent, unknown interactions between the molecules. On the other
hand, the study of large networks - such as Internet, social- road- and electric networks, etc.
- has provided interesting insights that have been successfully applied to genomes,
proteomes and bibliographic networks.
From the computational point of view, genomes and proteomes are described as
very large graphs in which the nodes (genes, proteins) and the edges (relations) are
unknown or unsure. These large and fuzzy descriptions are in sharp contrast with the
descriptions developed for well-defined molecular structures, but the methods are not
dissimilar to those used in other fields. Given the large and different genome sizes as well
as the uncertainties of the data, structured descriptions are not very useful for comparison.
Simple, unstructured representations like sets or vectors that can be easily compared in
terms of their known components are widely used. The approaches differ how the
components are selected and compared. (It is noted that this section concentrates only on
those genome-comparison studies that use genome-level descriptions. For phylogenetic
approaches to genome comparison see [63,64].
One group of approaches use predefined components, given in the form of a
classification. Proteins can be classified into several thousand orthologous groups (COGs)
and a genome (proteome) can be described by as a vector with a corresponding number of
components, each component denoting the presence or absence of a given protein group
[63]. This is an extremely simplified unstructured description, but the selected components
of the vector adequately describe the entire universe of protein functions as we know it
today. Two such vectors can then be compared using the Jaccard coefficient, and a related
distance measure (1 minus the coefficient) can be used as a metric for classifying the
genomes. This is a fast procedure that has no adjustable parameters, nevertheless it gave
results in good agreement with other, more subjective methods. In a similar way, proteins
can be grouped according to their similarity to sequences representing known 3D folds. In a
similar manner, the genomes can be classified in terms of the 3D folds[65,66].
Metabolic data are a further example of predefined component classification that
can be used for genome comparison [67,68]. A proteome can be described in terms of the
constituting enzymes, substrates, intermediate complexes such as given in the WIT
database [69]. Organism data can then be converted into vectors representing enzymes or
substrates in pathways or pathway-groups. For example, in a vector representing the
metabolic pathways of E. coli in terms of enzymes, a parameter e i is an integer denoting the
number of times enzyme i occurs in the metabolic pathways of the organism. Such vectors
can then be compared using any of the vector similarity/distance measures described above.
A classification of genomes based on vectors representing the metabolic and information-
processing pathways in terms of enzymes and substrates has shown that the system-level
organization of Archea and Eukarya are similar [70]. This comparison was based partly on
presence/absence data and the Jaccard coefficient, and partly on comparing the ranking data
of component frequencies.
Another group of representations uses sequence comparison to dynamically define
matching components between two genomes. The matching pairs of genes can be selected
based on BLAST scores [71], Smith-Waterman scores [72,73]. The intergenomic distances
can then be based on the list of shared (as well as total) components present in two
genomes, using e.g. the Jaccard coefficient. A particularly interesting version of this
method uses vicinal gene-pairs with conserved direction of transcription, identified from
Smith-Waterman searches [74-76]. Given the matching vicinal pairs in the two genomes as
substructures, one can proceed in the usual way. This method thus preserves the speed of
the comparison but uses substructures that are richer in detail i.e. capture a part of the gene
order.
Search WWH ::




Custom Search