Biological Data Science: Saving Lives with Software - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

ADAM, A Scalable Genome Analysis Platform

Aligning the reads to a reference genome is only the first of a series of steps necessary to

generate reports that are useful in a clinical or research setting. The early stages of this pro-

cessing pipeline look similar to any other extract-transform-load (ETL) pipelines that need

data deduplication and normalization before analysis.

The sequencing process duplicates genomic DNA, so it's possible that the same DNA reads

are generated multiple times; these duplicates need to be marked. The sequencer also

provides a quality estimate for each DNA “letter” that it reads, which has sequencer-specif-

ic biases that need to be adjusted. Aligners often misplace reads that have indels (inserted

or deleted sequences) that need to be repositioned on the reference genome. Currently, this

preprocessing is done using single-purpose tools launched by shell scripts on a single ma-

chine. These tools take multiple days to finish the processing of whole genomes. The pro-

cess is disk bound, with each stage writing a new file to be read into subsequent stages, and

is an ideal use case for applying general-purpose big data technology. ADAM is able to

handle the same preprocessing in under two hours.

ADAM is a genome analysis platform that focuses on rapidly processing petabytes of high-

coverage, whole genome data. ADAM relies on Apache Avro, Parquet, and Spark. These

systems provide many benefits when used together, since they:

▪ Allow developers to focus on algorithms without needing to worry about distrib-

uted system failures

▪ Enable jobs to be run locally on a single machine, on an in-house cluster, or in the

cloud without changing code

▪ Compress legacy genomic formats and provide predicate pushdown and projection

for performance

▪ Provide an agile way of customizing and evolving data formats

▪ Are designed to easily scale out using only commodity hardware

▪ Are shared with a standard Apache 2.0 license [ 163 ]

Literate programming with the Avro interface description

language (IDL)

The Sequence Alignment/Map (SAM) specification defines the mandatory fields listed in

Table 23-2 .

Search WWH ::

Custom Search

Home