Database Reference
In-Depth Information
TGTAATCCCAGCACTTTGGGAG, 31206
CTGTAATCCCAGCACTTTGGGA, 30809
GCCTCCCAAAGTGCTGGGATTA, 30716
...
ADAM can do much more than just count k -mers. Aside from the preprocessing stages
already mentioned — duplicate marking, base quality score recalibration, and indel re-
alignment — it also:
▪ Calculates coverage read depth at each variant in a Variant Call Format (VCF) file
▪ Counts the k -mers/ q -mers from a read dataset
▪ Loads gene annotations from a Gene Transfer Format (GTF) file and outputs the
corresponding gene models
▪ Prints statistics on all the reads in a read dataset (e.g., % mapped to reference,
number of duplicates, reads mapped cross-chromosome, etc.)
▪ Launches legacy variant callers, pipes reads into stdin, and saves output from
stdout
▪ Comes with a basic genome browser to view reads in a web browser
However, the most important thing ADAM provides is an open, scalable platform. All ar-
tifacts are published to Maven Central (search for group ID org.bdgenomics ) to make
it easy for developers to benefit from the foundation ADAM provides. ADAM data is
stored in Avro and Parquet, so you can also use systems like SparkSQL, Impala, Apache
Pig, Apache Hive, or others to analyze the data. ADAM also supports job written in Scala,
Java, and Python, with more language support on the way.
At Scala.IO in Paris in 2014, Andy Petrella and Xavier Tordoir used Spark's MLlib k-
means with ADAM for population stratification across the 1000 Genomes dataset (popu-
lation stratification is the process of assigning an individual genome to an ancestral
group). They found that ADAM/Spark improved performance by a factor of 150.
Search WWH ::




Custom Search