Database Reference
In-Depth Information
Chapter 23. Biological Data Science:
Saving Lives with Software
Matt Massie
It's hard to believe a decade has passed since the MapReduce paper appeared at OSDI'04.
It's also hard to overstate the impact that paper had on the tech industry; the MapReduce
paradigm opened distributed programming to nonexperts and enabled large-scale data pro-
cessing on clusters built using commodity hardware. The open source community respon-
ded by creating open source MapReduce-based systems, like Apache Hadoop and Spark,
that enabled data scientists and engineers to formulate and solve problems at a scale unima-
gined before.
While the tech industry was being transformed by MapReduce-based systems, biology was
experiencing its own metamorphosis driven by second-generation (or “next-generation”)
sequencing technology; see Figure 23-1 . Sequencing machines are scientific instruments
that read the chemical “letters” (A, C, T, and G) that make up your genome: your complete
set of genetic material. To have your genome sequenced when the MapReduce paper was
published cost about $20 million and took many months to complete; today, it costs just a
few thousand dollars and takes only a few days. While the first human genome took dec-
ades to create, in 2014 alone an estimated 228,000 genomes were sequenced world-
wide. [ 152 ] This estimate implies around 20 petabytes (PB) of sequencing data were gener-
ated in 2014 worldwide.
Search WWH ::




Custom Search