Information Technology Reference
In-Depth Information
organization levels are put in a common unidimensional structure. This is surely the
case of genomic texts; therefore the examples help to imagine the kinds of difficul-
ties faced in disclosing the internal logic of genomes.
Figures 2.59, 2.60, 2.61, 2.62, 2.63, and 2.64 are visual representations of the
genome of E. coli. The genome of this organism (a strain different from the more
common K12) is 5,528,445 base pairs. Colors denote bases according to the follow-
ing code: A green, C blue, G yellow, and T red. The size of edges (or sectors, or
circles) is proportional to the multiplicity of paths passing through them. Moreover,
in all these visualization we can focus on different parts of the structure, by enlarging
the view of some portions (Figs. 2.60, 2.62, 2.64 focus on parts of Figs. 2.60, 2.62,
2.64, respectively). The trees represented in these figures (with different visualiza-
tion methods, see [40]) are obtained by means of an algorithm, called “Crescendo”,
which is based on the elongation of initial seeds (in our case fragments of length 6),
which are chosen among the most frequent 6-mer words occurring in the genome,
plus the initial 6-mer word, or overlapping concatenations of these words occurring
in the genome. In the cases shown in the figures they are (multiplicities are between
brackets):
AGCT T TCT GGCG
[
1
] ,
CT GGCG
[
6032
] ,
CT GGCGCT GGCG
[
22
] ,
CT GGCGCT GGCGCT GGCG
[
1
] ,
AGCT T T
[
1280
] .
These words were elongated by starting from each of them and adding, in the or-
der, all the bases that in the genome follow the seed, by stopping the elongation
when the beginning of any seed is encountered. In this way the concatenation of
these elongations covers the whole genome. The different elongation fragments are
7208 and occur 7336 times, therefore the majority of them are hapaxes. For a better
visualization, the trees represented in the figures end when the number of possible
continuations are less than 30. These figures suggest many possible analyses, but
what we want to remark here is that they suggest perspectives of genome analyses
that probably are hidden when we represent genomes in classical manners.
Genome visualization is not only a matter of presentation. In fact, when colors,
forms, and geometrical shapes are used in visualizations, we use many-dimensional
perspectives, which prove to be more adequate for expressing systemic phenom-
ena. As shown in Tables 2.19, 2.20, 2.21, 2.22, and 2.23 the internal organization
of texts is hardly recognized in their sequential representations, because it is very
often based on many representation levels, and this aspect is what makes a text dif-
ferent from a string (see tables 2.21, 2.22, 2.23). An interesting phenomenon, com-
mon to the genome representations based on Crescendo, is their noise amplification .
Namely, if a genome is changed, in a small percentage, for example, by randomly
substituting some bases, then the corresponding (Crescendo) representations prove
to be different (with respect to reasonable metrics) in a percentage that is between 3
and 10 times the original variation percentage. This effect was confirmed by many
experiments, and can be explained by the additional structure of the representation
that becomes more sensible to small variations in the genome linear structure.
Search WWH ::




Custom Search