Information Technology Reference
In-Depth Information
greater than 10 24 . This means that the part of sequences of length 40 which occur in
that genome is a portion 10 18 smaller than the whole set of possible words. This
simple numerical evaluation tells us that surely sequences of length 40 have to be
meaningful. In fact if they occur, surely they were selected among a huge number
of other possible sequences. The real sequences of this length which we meet in a
genome are a small part of those which were evaluated, during the evolutive process.
In other words, the meaning of words is related to the relationship between the real
words and all possible words.
Let us denote by
the genomic alphabet of four symbols (characters, or letters,
associated to nucleotides):
Γ
Γ , as usual, denotes the set of all
Γ = {
A
,
T
,
C
,
G
}
(then
possible words over
).
A genome G is representable by a sequence over
Γ
Γ
, that is, a table assigning a
symbol of
to each position (from 1 to the length of G ). Symbols are written in a
linear order, from left to right, according to the standard writing system of western
languages, and to the chemical orientation 5
Γ
3 of DNA molecules.
We remark that other equivalent representations of sequences are possible. For
example, we could represent G , by a function associating to each symbol of
the set
of positions where it occurs. In this way G is identified by four sets of numbers, say
N
Γ
. It is also important to distinguish between subsequences
and substrings (also called words , factors , k - mers )of G . Indeed, a subsequence
is a sequence of symbols occurring on a set of positions (considered in their order),
while a substring is a subsequence of symbols which are (contiguously) associated
to all the positions between an initial and a final position (of course, any string is
also a sequence). If a genome has length n , then according to the Gauss triangular
formula, it has at most n
(
A
)
, N
(
T
)
, N
(
C
)
, N
(
G
)
1
of length 2, and so on, up to only one factor of length n ), while all the possible
subsequences are 2 n (the different ways of choosing sets of positions).
A dictionary D of a genome G is a factorization of G when the concatenation of
all the elements of D , possibly with overlapping sub-strings, yields G (the overlap-
ping concatenation of
(
n
1
) /
2 different factors ( n factors of length 1, n
). It is intended that in this concatenation
the elements of D may occur at least once, but possibly more than once. We re-
mark that the problem of genome sequencing can be expressed in the following
way. Given a genomic dictionary D (consisting of words of G , called reads ,usu-
ally of average lengths under 1000 bp), find the most probable genome G such
that D is a factorization of G , and where the concatenation of elements is always a
proper overlapping concatenation. Despite this simple formulation, this problem is
computationally complex and its solution is not uniquely defined, in mathematical
terms, but can be found, with a certain probabilistic belief (supported by the em-
pirical evidence) by means of different and repeated reconstruction experiments of
G from different factorizations of it. Nowadays, many different sequencing meth-
ods are available, which are based on different technologies. Crucial parameters of
these sequencing methods are the average length of reads and the number of hi-
erarchical phases of string assembling (where fragments of increasing lengths are
reconstructed from factorizations of these fragments).
αγ
with
γβ
is
αγβ
Search WWH ::




Custom Search