Database Reference
In-Depth Information
Join In
While we're off to a great start, the ADAM project is still an experimental platform and
needs further development. If you're interested in learning more about programming on
ADAM or want to contribute code, take a look at
Advanced Analytics with Spark: Patterns
for Learning from Data at Scale
by Sandy Ryza et al. (O'Reilly, 2014), which includes a
chapter on analyzing genomics data with ADAM and Spark. You can find us at
ht-
tp://bdgenomics.org
,
on IRC at #adamdev, or on Twitter at
@bigdatagenomics
.
[
152
]
See Antonio Regalado,
“EmTech: Illumina Says 228,000 Human Genomes Will Be Sequenced This
Year,”
September 24, 2014.
[
153
]
This process is called
protein folding
. The
Folding@home
allows volunteers to donate CPU cycles to
help researchers determine the mechanisms of protein folding.
[
154
]
There are also a few nonstandard amino acids not shown in the table that are encoded differently.
[
155
]
See Eva Bianconi et al.,
“An estimation of the number of cells in the human body,”
Annals of Human
Biology
, November/December 2013.
[
156
]
You might expect this to be 6.4 billion letters, but the reference genome is, for better or worse, a
haploid
representation of the average of dozens of individuals.
[
157
]
Only about 28% of your DNA is transcribed into nascent RNA, and after RNA splicing, only about
1.5% of the RNA is left to code for proteins. Evolutionary selection occurs at the DNA level, with most of
your DNA providing support to the other 0.5% or being deselected altogether (as more fitting DNA evolves).
There are some cancers that appear to be caused by dormant regions of DNA being resurrected, so to speak.
[
158
]
There is actually, on average, about 1 error for each billion DNA “letters” copied. So, each cell isn't
ex-
actly
the same.
[
159
]
Intentionally 50 years after Watson and Crick's discovery of the 3D structure of DNA.
[
160
]
Jonathan Max Gitlin,
“Calculating the economic impact of the Human Genome Project,”
June 2013.
[
161
]
There is also a second approach, de novo assembly, where reads are put into a graph data structure to
create long sequences without mapping to a reference genome.
[
162
]
Each base is about 3.4 angstroms, so the DNA from a single human cell stretches over 2 meters end to
end!
[
163
]
Unfortunately, some of the more popular software in genomics has an ill-defined or custom, restrictive
license. Clean open source licensing and source code are necessary for science to make it easier to reproduce
and understand results.
[
164
]
BAM is the compressed binary version of the SAM format.