Biology Reference
In-Depth Information
general purpose genomic resource has been instrumental in key recent find-
ings of X. tropicalis biology ( Bowes et al., 2008; Hellsten et al., 2010 ).
Unlike mouse and human genomes, X. tropicalis genome project did not
benefit from the inputs and curations of a large and extensive scientific
community. Indeed, this is the case for almost all eukaryotes genomes pub-
lished so far, except for a few model organisms. As a result, the draft ge-
nome assembly v4.1 of X. tropicalis is composed of 19,501 scaffolds (see
Fig. 10.1 A; Buisine & Sachs, 2009 ) representing unordered fragments of
the 10 chromosomes ( Tymowska & Fischberg, 1973 ). In the version
4.1, the largest scaffold is 7.8 Mb long and the N50
1.6 Mb. The version
7.1 is quite less fragmented, with only 7730 scaffolds, an N50 of 130 Mb,
with the longest scaffold being 216 Mb long. Interestingly, a small number
of the largest scaffolds capture a majority of the genes: 80% of the Ensembl
genes in the 834 top scaffolds ( Fig. 10.1 A). In addition, many assembly gaps
(stretches of “N”s corresponding to unsequenced regions) are found in
scaffolds and add an extra level of fragmentation ( Fig. 10.1 B and C). Given
that many multiexon genes are several tens of kilobases long, such gaps will
frequently be found in gene loci, resulting in alteration of apparent dis-
tances ( Fig. 10.1 B), as well as missing exons or gene fragments. Thus, al-
though most of the genome sequence is reflected by a small subset of
scaffolds (80% of the assembly in the top 2000 scaffolds), fragmentation re-
mains high ( Fig. 10.1 C). BAC, cosmid and fosmid libraries have been end-
sequenced to resolve the physical continuity over large size range ( Hellsten
et al., 2010 ). Nonetheless, end-sequenced scaffolds are still present in the
assembly, artificially increasing the size of
¼
the assembled genome by
85 Mbp. They cluster as dense clumps of points on Fig. 10.1 A
(highlighted by the three leftmost ellipses). In addition, some regions
(end-sequences) may be represented multiple times in the genome se-
quence. While the assembled genome is 1.5 Gbp long, one has to take into
account that overall, 20% of the released assembly is actually not se-
quenced. However, unassembled regions of the genome assembly result
from low complexity regions, or repeated sequences, which makes both
base detection—in the case of long stretches of similar nucleotides—and
assembly—because of the ambiguity introduced—more difficult. In addi-
tion, assembly gaps could also be due to the presence of allelic polymor-
phism, which could induce hard to resolve ambiguities during the
assembly process. Intraspecies polymorphism can be ruled out in the case
of X. tropicalis since only one inbred individual frog was used for genome
sequencing ( Hellsten et al., 2010 ).
Search WWH ::




Custom Search