Database Reference
In-Depth Information
Figure 23-1. Timeline of big data technology and cost of sequencing a genome
The plummeting cost of sequencing points to superlinear growth of genomics data over
the coming years. This DNA data deluge has left biological data scientists struggling to
process data in a timely and scalable way using current genomics software. The AMPLab
is a research lab in the Computer Science Division at UC Berkeley focused on creating
novel big data systems and applications. For example, Apache Spark (see Chapter 19 ) is
one system that grew out of the AMPLab. Spark recently broke the world record for the
Daytona Gray Sort, sorting 100 TB in just 23 minutes. The team at Databricks that broke
the record also demonstrated they could sort 1 PB in less than 4 hours!
Consider this amazing possibility: we have technology today that could analyze every
genome collected in 2014 on the order of days using a few hundred machines.
While the AMPLab identified genomics as the ideal big data application for technical
reasons, there are also more important compassionate reasons: the timely processing of
Search WWH ::




Custom Search