Database Reference
In-Depth Information
Sequencing and Aligning DNA
Second-generation sequencing is rapidly evolving, with numerous hardware vendors and
new sequencing methods being developed about every six months; however, a common
feature of all these technologies is the use of massively parallel methods, where thousands
or even millions of reactions occur simultaneously. The double-stranded DNA is split down
the middle, the single strands are copied many times, and the copies are randomly shredded
into small fragments of different lengths called reads , which are placed into the sequencer.
The sequencer reads the “letters” in each of these reads, in parallel for high throughput, and
outputs a raw ASCII file containing each read (e.g., AGTTTCGGGATC... ), as well as a
quality estimate for each letter read, to be used for downstream analysis.
A piece of software called an aligner takes each read and works to find its position in the
reference genome (see Figure 23-3 ). [ 161 ] A complete human genome is about 3 billion base
(A, C, T, G) pairs long. [ 162 ] The reference genome (e.g., GRCh38 ) acts like the picture on a
puzzle box, presenting the overall contours and colors of the human genome. Each short
read is like a puzzle piece that needs to be fit into position as closely as possible. A com-
mon metric is “edit distance,” which quantifies the number of operations necessary to
transform one string to another. Identical strings have an edit distance of zero, and an indel
of one letter has an edit distance of one. Since humans are 99.9% identical to one another,
most of the reads will fit to the reference quite well and have a low edit distance. The chal-
lenge with building a good aligner is handling idiosyncratic reads.
Search WWH ::




Custom Search