Strings and Genomes - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

greater than 10 24 . This means that the part of sequences of length 40 which occur in

that genome is a portion 10 − 18 smaller than the whole set of possible words. This

simple numerical evaluation tells us that surely sequences of length 40 have to be

meaningful. In fact if they occur, surely they were selected among a huge number

of other possible sequences. The real sequences of this length which we meet in a

genome are a small part of those which were evaluated, during the evolutive process.

In other words, the meaning of words is related to the relationship between the real

words and all possible words.

Let us denote by

the genomic alphabet of four symbols (characters, or letters,

associated to nucleotides):

Γ , as usual, denotes the set of all

Γ = {

}

(then

possible words over

A genome G is representable by a sequence over

, that is, a table assigning a

symbol of

to each position (from 1 to the length of G ). Symbols are written in a

linear order, from left to right, according to the standard writing system of western

languages, and to the chemical orientation 5 −

3 of DNA molecules.

We remark that other equivalent representations of sequences are possible. For

example, we could represent G , by a function associating to each symbol of

the set

of positions where it occurs. In this way G is identified by four sets of numbers, say

. It is also important to distinguish between subsequences

and substrings (also called words , factors , k - mers )of G . Indeed, a subsequence

is a sequence of symbols occurring on a set of positions (considered in their order),

while a substring is a subsequence of symbols which are (contiguously) associated

to all the positions between an initial and a final position (of course, any string is

also a sequence). If a genome has length n , then according to the Gauss triangular

formula, it has at most n

(

)

, N

(

)

, N

(

)

, N

(

)

of length 2, and so on, up to only one factor of length n ), while all the possible

subsequences are 2 n (the different ways of choosing sets of positions).

A dictionary D of a genome G is a factorization of G when the concatenation of

all the elements of D , possibly with overlapping sub-strings, yields G (the overlap-

ping concatenation of

(

−

) /

2 different factors ( n factors of length 1, n

−

). It is intended that in this concatenation

the elements of D may occur at least once, but possibly more than once. We re-

mark that the problem of genome sequencing can be expressed in the following

way. Given a genomic dictionary D (consisting of words of G , called reads ,usu-

ally of average lengths under 1000 bp), find the most probable genome G such

that D is a factorization of G , and where the concatenation of elements is always a

proper overlapping concatenation. Despite this simple formulation, this problem is

computationally complex and its solution is not uniquely defined, in mathematical

terms, but can be found, with a certain probabilistic belief (supported by the em-

pirical evidence) by means of different and repeated reconstruction experiments of

G from different factorizations of it. Nowadays, many different sequencing meth-

ods are available, which are based on different technologies. Crucial parameters of

these sequencing methods are the average length of reads and the number of hi-

erarchical phases of string assembling (where fragments of increasing lengths are

reconstructed from factorizations of these fragments).

αγ

with

γβ

αγβ

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home