Database Reference
In-Depth Information
ADAM, A Scalable Genome Analysis Platform
Aligning the reads to a reference genome is only the first of a series of steps necessary to
generate reports that are useful in a clinical or research setting. The early stages of this pro-
cessing pipeline look similar to any other extract-transform-load (ETL) pipelines that need
data deduplication and normalization before analysis.
The sequencing process duplicates genomic DNA, so it's possible that the same DNA reads
are generated multiple times; these duplicates need to be marked. The sequencer also
provides a quality estimate for each DNA “letter” that it reads, which has sequencer-specif-
ic biases that need to be adjusted. Aligners often misplace reads that have indels (inserted
or deleted sequences) that need to be repositioned on the reference genome. Currently, this
preprocessing is done using single-purpose tools launched by shell scripts on a single ma-
chine. These tools take multiple days to finish the processing of whole genomes. The pro-
cess is disk bound, with each stage writing a new file to be read into subsequent stages, and
is an ideal use case for applying general-purpose big data technology. ADAM is able to
handle the same preprocessing in under two hours.
ADAM is a genome analysis platform that focuses on rapidly processing petabytes of high-
coverage, whole genome data. ADAM relies on Apache Avro, Parquet, and Spark. These
systems provide many benefits when used together, since they:
▪ Allow developers to focus on algorithms without needing to worry about distrib-
uted system failures
▪ Enable jobs to be run locally on a single machine, on an in-house cluster, or in the
cloud without changing code
▪ Compress legacy genomic formats and provide predicate pushdown and projection
for performance
▪ Provide an agile way of customizing and evolving data formats
▪ Are designed to easily scale out using only commodity hardware
▪ Are shared with a standard Apache 2.0 license [ 163 ]
Literate programming with the Avro interface description
language (IDL)
The Sequence Alignment/Map (SAM) specification defines the mandatory fields listed in
Table 23-2 .
Search WWH ::




Custom Search