Biological Data Science: Saving Lives with Software - Hadoop: The Definitive Guide

Database Reference

In-Depth Information

Chapter 23. Biological Data Science:

Saving Lives with Software

Matt Massie

It's hard to believe a decade has passed since the MapReduce paper appeared at OSDI'04.

It's also hard to overstate the impact that paper had on the tech industry; the MapReduce

paradigm opened distributed programming to nonexperts and enabled large-scale data pro-

cessing on clusters built using commodity hardware. The open source community respon-

ded by creating open source MapReduce-based systems, like Apache Hadoop and Spark,

that enabled data scientists and engineers to formulate and solve problems at a scale unima-

gined before.

While the tech industry was being transformed by MapReduce-based systems, biology was

experiencing its own metamorphosis driven by second-generation (or “next-generation”)

sequencing technology; see Figure 23-1 . Sequencing machines are scientific instruments

that read the chemical “letters” (A, C, T, and G) that make up your genome: your complete

set of genetic material. To have your genome sequenced when the MapReduce paper was

published cost about $20 million and took many months to complete; today, it costs just a

few thousand dollars and takes only a few days. While the first human genome took dec-

ades to create, in 2014 alone an estimated 228,000 genomes were sequenced world-

wide. [ 152 ] This estimate implies around 20 petabytes (PB) of sequencing data were gener-

ated in 2014 worldwide.

Search WWH ::

Custom Search

Home