Information Technology Reference
In-Depth Information
5. It would be interesting to compute intersections of genomic dictionaries and to
investigate if words which are common to many genomes are conserved along
evolutive lineages.
6. The inter-genomic character of hapaxes and repeats could be investigated, for
determining which hapaxes (resp repeats) of a given genome keep or not their
status of hapax (resp. repeat) within a given class of genomes.
7. New kinds of genomic representations could be investigated which could be use-
ful for specific analyses of genomic features relevant in specific contexts.
The last point of the list above is the basis for possible important developments.
Representing genomes, in non-conventional manners, could open new possibili-
ties in genome analysis. In fact, within a certain genomic representation some
aspects could emerge which are not evident in other conventional ways of ex-
pressing genomes. An interesting issue, in this context, concerns genome com-
pression methods. A genome compression provides a compact way of identifying
genome sequences, without loss of information. Many approaches have been inves-
tigated, where classical compression methods of information theory were applied
and adapted to genomes [31]. In general, genomes are sequences not efficiently
compressible, with standard methods.
Now, let us assume we find an encoding that, within some class of genomes, is
able to provide efficient compressions, for example, by reducing a genome G of n
bases to a sequence g of n
10 bits (an incredible compression ratio, with respect to
the actual genomic compressions which rarely are below 1.7 bits per base, in the
average case). Even if this encoding were hardly reversible, if injective, no loss of
information would occur in the passage from G to g . We call this kind of encoding
an one-way compression , because it reduces the digital information of G ,butthe
recovering of G from g is not efficient, or even computationally hard, in the sense of
computational complexity (for example NP -hard). Nevertheless, g couldbeveryim-
portant for representing the global identity of G in comparisons with other genomes
and, for its reduced size, more adequate for some specific genomic analyses.
Another kind of investigations concerns the lossy compression methods. Let us
suppose that a genome G is represented by a shorter string g that does not identify
G ,but g retains some specific feature of G that is more evident, or more easily
detectable (for the reduced size of g ). If this is the case, the compression method
adopted for generating g could provide an important clue, in the context particular
situations of genome classification.
It is too early for evaluating the biological interest of these approaches, but it
is reasonable to believe that new methods of genome representation could disclose
relevant aspects in the informational organization of genomes.
Examples of graphics, related to genomic distributions defined above, are given
in Figs 2.54 - 2.58.
Genomes cannot be fully considered as simple sequences, but more properly,
texts in the usual sense of our written texts of natural or artificial languages. In fact,
what we call a text is very often apparently a linear structure, because information
is organized at several levels that we distinguish by means of different kinds of in-
formation which are added to the basic linear structure: different kinds of alphabets
/
Search WWH ::




Custom Search