Information Technology Reference
In-Depth Information
categories, distributional classes) for recognizing and analyzing functional roles of
DNA fragments.
Here we briefly outline an approach, which we call Infogenomics [20, 27],
under development, where we apply methods of informational text analysis specif-
ically conceived for genome analysis. Infogenomics is aimed at devising and com-
puting informational indexes able to provide systematic “localization” of genomes
in many-dimension spaces where these indexes may vary. If these indexes are de-
fined in an adequate manner, then the genome characterization by means of them
could correspond to important biological or evolutive properties which reflect their
internal organization, and could provide profiles for comparing genomes of different
species or even of different physiological/pathological situations.
Bioinformatics had a crucial role in the analysis of biopolymers by means of
the notion of sequence similarity and sequence alignment . In this field, many algo-
rithms on strings were essential in a huge number of important applications. The
epochal sequencing of complex genomes was surely impossible without the ex-
istence of these algorithms and of their efficient implementation. However, now
many genomes, of different types of organisms, are available as files in public sites.
The number of sequenced genomes is near to 1000, from bacteria to Homo sapiens
(without counting viruses). They are treasures and an integrated analysis of their
informational, mathematical, and linguistic features could reveal new clues in the
challenge of understanding their languages . This perspective requires a systemic
approach where genomes are considered not only strings, but structures based on
strings and the components and features of these structures could be discovered by
comparing them. This emerging perspective [35, 36, 34, 62] is based on alignment
free methods of genome analysis, where global properties of genomes are investi-
gated, rather than local similarities based on classical methods of string alignment.
On the side of molecular biology and biochemistry many international projects
are active for deciphering genomes, in order to pass from the knowledge of genome
sequences to their biological functions. In particular, the project ENCODE (ENCy-
clopedia Of DNA Elements) [23] is mainly aimed to extract lexicons, and catalogs
of biochemically annotated DNA elements, in the human genome. In this context,
biochemical functions were assigned to 80% of the genome, mainly outside the
protein-coding regions (with a clear evidence of their crucial role in regulation of
gene expression). A very complex dynamics of interactions results among DNA re-
gions, proteins, and RNAs, with a lot of newly identified elements, and with a huge
number of data (see websites: http://nature.com/encode, http://epd.vital-it.ch). This
scenario has certainly an informational basis, linked to the DNA strings related to
these elements. Therefore, an integration between biochemical and informational
perspective could provide important synergies, with new possibilities for interpret-
ing data and for discovering principles of genome organizations and functions.
A simple argument can show the crucial aspect that dictionaries play in the anal-
yses of genomes. Let us consider a genome G with a length of 10 6 bases. All
the sequences of length 40 which we encounter by scanning all the genomes are
(
10 6
, moreover possibly many of them occur many times. However, the num-
ber of possible different words of four letters having length 40 is 4 40 which is a value
39
)
 
Search WWH ::




Custom Search