Biology Reference
In-Depth Information
sequencing all short and mate-paired libraries, we generated
39 Gb of
usable paired-end sequence data, from which adaptors were trimmed and
low-quality sequences and read-duplicates removed. 11
Because both template heterozygosity and coverage affect an
assembly, 16 we estimated the heterozygosity in the A. suum dataset by
calculating the frequency of occurrence of each 17 bp k -mer (i.e. each
unique combination of 17 nucleotides occurring in the dataset) within the
genomic sequence dataset (from the 170 bp library) and, based on this
calculation, estimated 17
w
the genome size to be
300 Mb, suggesting
w
a mean coverage of the genome of
80-fold, which we deemed to be more
>
than adequate for assembly.
Genome Assembly
Although short-read platforms provide unprecedented capacity to
sequence large genomes in a cost- and time-effective manner, the
assembly of short-read data presents significant technical challenges.
Therefore, a wide-array of advanced and complex mathematical algo-
rithms has been developed for the assembly process. The principles of
these mathematical approaches have been reviewed thoroughly 18 and are,
thus, not repeated here. For the assembly of the genome of A. suum ,we
used the program SOAPdenovo 16 to join overlapping, single-end read
data from the short-insert library datasets into contigs using a de Bruijn
graph approach, 18 and then connected, through an iterative process,
contigs into scaffolds using paired-end data from the large-insert mate-
paired libraries. 11 Between each iteration of the scaffold assembly phase,
we conducted local assemblies in gap regions between the contigs using
sequence data from the short-insert libraries. Due to the technical
constraints of the de Bruijn graph assembly algorithm, this scaffolding
approach requires that a heavy “weighting” be applied in favor of data
generated from the mate-paired libraries. However, because the WGA
process has the potential to introduce substitution errors, we needed to
account for this weighting by remapping all raw reads to the final
assembly using the programMaq 19 to produce a final, unbiased consensus
sequence. This approach yielded a high-quality assembly of
273Mb
w
represented by
2 kb, with an N50 of 408 kb (50% of all
nucleotides in the assembly are in contigs of
w
1600 contigs of
>
408 kb in length) and an N90
of 80 kb (90% of all nucleotides in the assembly are in contigs of
80 kb in
length) ( Table 11.1 ).
Gene Prediction and Annotation
Following the assembly, a multi-step process is required to predict the
coding regions of the genome and their function based on comparisons
Search WWH ::




Custom Search