Biological Data Science: Saving Lives with Software - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

TGTAATCCCAGCACTTTGGGAG, 31206

CTGTAATCCCAGCACTTTGGGA, 30809

GCCTCCCAAAGTGCTGGGATTA, 30716

...

ADAM can do much more than just count k -mers. Aside from the preprocessing stages

already mentioned — duplicate marking, base quality score recalibration, and indel re-

alignment — it also:

▪ Calculates coverage read depth at each variant in a Variant Call Format (VCF) file

▪ Counts the k -mers/ q -mers from a read dataset

▪ Loads gene annotations from a Gene Transfer Format (GTF) file and outputs the

corresponding gene models

▪ Prints statistics on all the reads in a read dataset (e.g., % mapped to reference,

number of duplicates, reads mapped cross-chromosome, etc.)

▪ Launches legacy variant callers, pipes reads into stdin, and saves output from

stdout

▪ Comes with a basic genome browser to view reads in a web browser

However, the most important thing ADAM provides is an open, scalable platform. All ar-

tifacts are published to Maven Central (search for group ID org.bdgenomics ) to make

it easy for developers to benefit from the foundation ADAM provides. ADAM data is

stored in Avro and Parquet, so you can also use systems like SparkSQL, Impala, Apache

Pig, Apache Hive, or others to analyze the data. ADAM also supports job written in Scala,

Java, and Python, with more language support on the way.

At Scala.IO in Paris in 2014, Andy Petrella and Xavier Tordoir used Spark's MLlib k-

means with ADAM for population stratification across the 1000 Genomes dataset (popu-

lation stratification is the process of assigning an individual genome to an ancestral

group). They found that ADAM/Spark improved performance by a factor of 150.

Search WWH ::

Custom Search

Home