Strings and Genomes - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

organization levels are put in a common unidimensional structure. This is surely the

case of genomic texts; therefore the examples help to imagine the kinds of difficul-

ties faced in disclosing the internal logic of genomes.

Figures 2.59, 2.60, 2.61, 2.62, 2.63, and 2.64 are visual representations of the

genome of E. coli. The genome of this organism (a strain different from the more

common K12) is 5,528,445 base pairs. Colors denote bases according to the follow-

ing code: A green, C blue, G yellow, and T red. The size of edges (or sectors, or

circles) is proportional to the multiplicity of paths passing through them. Moreover,

in all these visualization we can focus on different parts of the structure, by enlarging

the view of some portions (Figs. 2.60, 2.62, 2.64 focus on parts of Figs. 2.60, 2.62,

2.64, respectively). The trees represented in these figures (with different visualiza-

tion methods, see [40]) are obtained by means of an algorithm, called “Crescendo”,

which is based on the elongation of initial seeds (in our case fragments of length 6),

which are chosen among the most frequent 6-mer words occurring in the genome,

plus the initial 6-mer word, or overlapping concatenations of these words occurring

in the genome. In the cases shown in the figures they are (multiplicities are between

brackets):

AGCT T TCT GGCG

[

1

] ,

CT GGCG

[

6032

] ,

CT GGCGCT GGCG

[

22

] ,

CT GGCGCT GGCGCT GGCG

[

1

] ,

AGCT T T

[

1280

] .

These words were elongated by starting from each of them and adding, in the or-

der, all the bases that in the genome follow the seed, by stopping the elongation

when the beginning of any seed is encountered. In this way the concatenation of

these elongations covers the whole genome. The different elongation fragments are

7208 and occur 7336 times, therefore the majority of them are hapaxes. For a better

visualization, the trees represented in the figures end when the number of possible

continuations are less than 30. These figures suggest many possible analyses, but

what we want to remark here is that they suggest perspectives of genome analyses

that probably are hidden when we represent genomes in classical manners.

Genome visualization is not only a matter of presentation. In fact, when colors,

forms, and geometrical shapes are used in visualizations, we use many-dimensional

perspectives, which prove to be more adequate for expressing systemic phenom-

ena. As shown in Tables 2.19, 2.20, 2.21, 2.22, and 2.23 the internal organization

of texts is hardly recognized in their sequential representations, because it is very

often based on many representation levels, and this aspect is what makes a text dif-

ferent from a string (see tables 2.21, 2.22, 2.23). An interesting phenomenon, com-

mon to the genome representations based on Crescendo, is their noise amplification .

Namely, if a genome is changed, in a small percentage, for example, by randomly

substituting some bases, then the corresponding (Crescendo) representations prove

to be different (with respect to reasonable metrics) in a percentage that is between 3

and 10 times the original variation percentage. This effect was confirmed by many

experiments, and can be explained by the additional structure of the representation

that becomes more sensible to small variations in the genome linear structure.

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home