Biology Reference
In-Depth Information
so high, this increases the chances that overfitting to the
observed data will become an significant issue, leading
to results that appear promising but that in reality will
not hold up for clinical use. For the quantified self,
transcriptomics will be only one of the data types, and
thus the number of measurements means that statistical
approaches to control for overfitting will be essential
[36,51] . Fortunately, the continued dramatic reduction
in cost for many 'omics' technologies will help to
mitigate some of these challenges bymaking it possible
to analyze larger sample numbers, but for what is
contemplated in the near future for P4 medicine we are
still very much in the small sample regime, and will be
for quite some time to come.
This high dimensionality of data exacerbates issues
related to data reproducibility, which can be difficult to
reproduce in detail from study to study, or even from
batch to batch [52] . For example, in order to statistically
expect overlap of 50% in identified differentially
expressed genes between two different studies comparing
breast cancer to normal tissue, one would need of the
order of thousands of samples [53] . Almost all individual
studies today have fewer than that, and so even if executed
with the highest possible experimental rigor one would
not expect differentially expressed gene lists between
studies to be very similar
additional studies than do signatures learned from one
study alone. This fact argues strongly for the need to [1]
build consortiums that enable the integration of large
amounts of data from multiple sites [14] and [2] make
data publicly available so that they can be aggregated
together in meta-analyses. Such data integration is
essential to enable P4 medicine.
Computational challenges of blood as a window.
The computational challenges associated with main-
tenance of wellness and the pre-symptomatic diag-
nosis and prevention of disease are particularly
challenging. For example, the envisioned blood diag-
nostics of the future must be able to distinguish not just
one disease from normal but rather must differentiate
any possible disease against the background of normal
that can be affected by many conditions
including
even mundane changes such as diet, exercise, time of
day, sleep cycle and so forth. As is well appreciated in
machine learning, accuracies tend to degrade quickly
as more potential phenotypes need to be separated
simultaneously. Deciphering signal from noise against
such a dynamic and multifaceted background is
a daunting challenge indeed. A number of strategies
will therefore be important to harness the information
content of the blood as a window to health and disease.
1) It is unlikely that any one data platform will be
sufficient to achieve the accuracies that will be needed
for clinical practice across the wide range of possible
disease states. Therefore, to achieve the predictive and
preventive aspects of P4 medicine will require multi-
faceted data analysis, including multiple sources of
molecular data from the blood. As is described here,
the blood contains enormous numbers of different
molecular information sources, including not just the
proteins, but also metabolites, miRNAs, mRNAs,
circulating cells, antibodies and so forth. It will also be
important to link these molecular data with clinical
data as well as input from activated and digitally net-
worked patients (such as changes in lifestyle, envi-
ronmental exposures). Patient activation refers to
a person's willingness and ability to manage their
health and healthcare, as measured by the Patient
Activation Measure [56] . 2) Another key to address
this challenge is to build coarse-to-fine hierarchies,
where coarse overall assessments are made initially
and then followed by tests of increasingly finer levels
of specificity, for disease diagnosis and wellness
monitoring. For example, organ-specific blood
proteins can be used to first answer the question of
what organ system is being perturbed. Following this
assessment, more specific molecular markers of finer
resolution that differentiate different diseases of the
e
even if everything is done
correctly by both laboratories doing the studies. The same
is true for identified molecular signatures, where
a number of molecular measurements are coupled with
a computational algorithm to differentiate phenotypes
(e.g., make a disease diagnosis) [54] .
Importantly, when one study is used to train
a molecular signature to differentiate between different
phenotypes (e.g., cancer vs. control) and then tested on
a separate study of the same phenotypes, very often the
classifier will fail, or at best the signature performance
degrades severely. A primary reason for this drop in
performance comes from heterogeneity between
studies, due both to underlying variance in the biology
of the patients studied and to technical variations in
precisely how the data were measured, normalized, and
analyzed. Whenever two individual studies are
compared it is very often the case that the differences
that we will refer to here as laboratory effects are greater
than the differences due to phenotype (e.g., cancer vs.
normal). One powerful means for making signatures
much more robust is to integrate their identification
across multiple studies at multiple sites [55] .Insuch
integrated studies, the signal associated with the
phenotype difference is amplified, while the laboratory
effects are damped out. Signatures learned across
multiple different studies from multiple different labo-
ratories perform much better on average on yet
e
Search WWH ::




Custom Search