Biology Reference
In-Depth Information
FIGURE 26.1 Big data are all around us, enabled by technological advances in micro- and nano-electronics, nano materials, interconnectivity
provided by sophisticated telecommunication infrastructure, massive network-attached storage capabilities, and commodity-based high-
performance computing infrastructures. The ability to store all credit card transactions, all cell phone traffic, all email traffic, video from extensive
networks of surveillance devices, satellite and ground sensing data informing on all aspects of the weather and overall climate, and now generate and store
massive data informing on our personal health, including whole genome sequencing data and extensive imagining data, is driving a revolution in high-end
data analytics to make sense of the big data, drive more accurate descriptive and predictive models that inform decision making on every level, whether
identifying the next big security threat or making the best diagnosis and treatment choice for a given patient.
thousands of protein- and non-coding genes simultaneously
[3,4] , score hundreds of thousands of SNPs (single nucle-
otide polymorphisms) in individual samples [5] , sequence
entire human genomes now for less than $5000 [6] , and
relate all of these data patterns to a great diversity of other
biologically relevant information (clinical data, biochem-
ical data, social networking data etc.). Given technologies
on the horizon such as the IBM DNA transistor with
theoretical sequencing limits in the hundreds of millions of
bases per second per transistor (imagine millions of these
transistors packed together in a single handheld device) [7] ,
we will not be talking in the future about Google rolling
through neighborhoods with Wi-Fi sniffing equipment [8] ,
but rather about DNA sniffing equipment rolling through
neighborhoods sequencing everything they encounter in
real time, and then pumping such data into big data clouds
to link with all other available information in the digital
universe.
Keeping pace with these life sciences technology
advances are information technology advances, in which
now more 'classic' information-savvy companies such as
Microsoft, Amazon, Google, Facebook, Ebay, and Yahoo,
as well as a new breed of emerging big data mining
companies such as Recorded Future, Factual, Locu, and
Palantir, have led the way in becoming masters of petabyte-
and exabyte-scale datasets, linking pieces of data distrib-
uted over massively parallel architectures in response to
user requests and presenting them to the user in a matter of
seconds. Following these advances in other disciplines, we
are on track to access the same types of tools to tackle the
big data problems now being faced by the life and
biomedical sciences. But large-scale data generation and
big computer infrastructures are just two legs of the stool
regarding what is needed to revolutionize our under-
standing of living systems. While the data revolution is
driven by technologies that provide insights into how living
systems operate, understanding living systems will require
that we master the information the high-throughput tech-
nologies are generating.
If we want to achieve understanding from big data,
organize it, compute on it, build predictive models from it,
then we must employ statistical reasoning beyond the
more classic hypothesis testing of yesteryear. We have
moved well beyond the idea that we can simply repeat
experiments to validate findings generated in populations.
In fact, while first instances of the Central Dogma of
Biology looked something like the simple graph depicted
in Figure 26.2 a, today, given the complex interplay of
Search WWH ::




Custom Search