Biology Reference
In-Depth Information
be directed, indicating a cause-and-effect relationship, or
undirected, indicating an association or interaction. For
example, a DNA node in the network representing a given
locus that varies in a population of interest may be con-
nected to a transcript abundance trait, indicating that
changes at the particular DNA locus induce changes in the
levels of the transcript. The potentially millions of such
relationships represented in a network defines the overall
connectivity structure of the network, or what is otherwise
known as the topology of the network. Any realistic
network topology will be necessarily complicated and non-
linear from the standpoint of the more classic biochemical
pathway diagrams represented in textbooks and pathway
databases such as KEGG [37] . The more classic pathway
view represents molecular processes on an individual level,
whereas networks represent global (population level)
metrics that describe variation between individuals in
a population of interest that in turn defines coherent bio-
logical processes in the tissue or cells associated with the
network.
a quantitative trait such as the transcript abundance of
a given gene or levels of a given metabolite. The condi-
tional probabilities reflect not only relationships between
genes, but also the stochastic nature of these relationships,
as well as noise in the data used to reconstruct the network.
Bayes' formula allows us to determine the likelihood of
a network model M given observed data D as a function
of our prior belief that the model is correct and the prob-
ability
of
the
observed
data
given
the model:
P ð M j D Þ w
P ð D j M Þ P ð M Þ . The number of possible network
structures grows super-exponentially with the number of
nodes, so an exhaustive search of all possible structures to
find the one best supported by the data is not feasible, even
for a relatively small number of nodes. A number of
algorithms exist to find the optimal network without
searching exhaustively, e.g., Monte Carlo Markov Chain
(MCMC) [41] simulation. With the MCMC algorithm,
optimal networks are constructed from a set of starting
conditions. This algorithm is run thousands of times to
identify different plausible networks, each time beginning
with different starting conditions. These most plausible
networks can then be combined to obtain a consensus
network. For each of the reconstructions using the MCMC
algorithm, the starting point is a null network. Small
random changes are made to the network by flipping,
adding, or deleting individual edges, ultimately accepting
those changes that lead to an overall improvement in the fit
of the network to the data. To assess whether a change
improves the network model or not, information measures
such as the Bayesian information criterion (BIC) [42] are
employed, which reduces overfitting by imposing a cost on
the addition of new parameters. This is equivalent to
imposing a lower prior probability P ð M Þ on models with
larger numbers of parameters.
Even though edges in Bayesian networks are directed,
we cannot in general infer causal relationships from the
structure directly, just as discussed above in relation to the
causal inference test. For a network with three nodes, X 1 ,
X 2 , and X 3 , there are multiple groups of structures that are
mathematically equivalent. For example, the three models
M1
An Integrative Genomics Approach to
Constructive Predictive Network Models
Systematically integrating different types of data into
probabilistic networks using Bayesian networks has been
proposed and applied for the purpose of predicting pro-
tein
protein interactions [38] and protein function [39] .
However, these Bayesian networks are still based on
association between nodes in the network, as opposed to
causal relationships. As discussed above for the simple case
of two traits, from these types of networks we cannot infer
whether a specific perturbation will affect a complex
disease trait. To make such predictions we need networks
capable of representing causal relationships. Probabilistic
causal networks are one way to model such relationships
from the top down, where causality again in this context
reflects a probabilistic belief that one node in the network
affects the behavior of another. Bayesian networks [40] are
one type of probabilistic causal network that provide
a natural framework for integrating highly dissimilar types
of data.
Bayesian networks are directed acyclic graphs in which
the edges of the graph are defined by conditional proba-
bilities that characterize the distribution of states of each
node given the state of its parents [40] . The network
topology defines a partitioned joint probability distribution
over all nodes in a network, such that the probability
distribution of states of a node depends only on the states of
its parent nodes: formally, a joint probability distribution
p ð X Þ
e
X 1 /
X 2 X 2 /
X 3 ;
M2
X 2 /
X 1 X 2 /
X 3 ;
and
:
:
M3
X 2 / X 1 X 3 / X 2 , are all Markov equivalent, meaning
that they all encode for the same conditional independence
relationship: X 1 t X 3 j X 2 , X 1 and X 3 are independent
conditional on X 2 . In addition, these models are mathe-
matically equivalent:
:
p ð X Þ¼ p ð M1 j D Þ¼ p ð X 2 j X 1 Þ p ð X 1 Þ p ð X 3 j X 2 Þ
¼ p ð M2 j D Þ¼ p ð X 1 j X 2 Þ p ð X 2 Þ p ð X 3 j X 2 Þ
¼ p ð M3 j D Þ¼ p ð X 2 j X 3 Þ p ð X 3 Þ p ð X 1 j X 2 Þ
on a set of nodes X can be decomposed as
p ð X Þ¼ Q i
p ð X i
j Pa ð X i
ÞÞ , where Pa ð X i
Þ represents the
parent set of X i . The biological networks of interest we
wish to construct are comprised of nodes that represent
Thus, from correlation data alone we cannot infer
whether X 1 is causal for X 2 or vice versa from these types of
structures.
It
is worth noting, however,
that
there is
Search WWH ::




Custom Search