Causal Inference and the Construction of Predictive Network Models in Biology - Systems Biology Concepts and Insights

Biology Reference

In-Depth Information

be directed, indicating a cause-and-effect relationship, or

undirected, indicating an association or interaction. For

example, a DNA node in the network representing a given

locus that varies in a population of interest may be con-

nected to a transcript abundance trait, indicating that

changes at the particular DNA locus induce changes in the

levels of the transcript. The potentially millions of such

relationships represented in a network defines the overall

connectivity structure of the network, or what is otherwise

known as the topology of the network. Any realistic

network topology will be necessarily complicated and non-

linear from the standpoint of the more classic biochemical

pathway diagrams represented in textbooks and pathway

databases such as KEGG [37] . The more classic pathway

view represents molecular processes on an individual level,

whereas networks represent global (population level)

metrics that describe variation between individuals in

a population of interest that in turn defines coherent bio-

logical processes in the tissue or cells associated with the

network.

a quantitative trait such as the transcript abundance of

a given gene or levels of a given metabolite. The condi-

tional probabilities reflect not only relationships between

genes, but also the stochastic nature of these relationships,

as well as noise in the data used to reconstruct the network.

Bayes' formula allows us to determine the likelihood of

a network model M given observed data D as a function

of our prior belief that the model is correct and the prob-

ability

the

observed

data

given

the model:

P ð M j D Þ w

P ð D j M Þ P ð M Þ . The number of possible network

structures grows super-exponentially with the number of

nodes, so an exhaustive search of all possible structures to

find the one best supported by the data is not feasible, even

for a relatively small number of nodes. A number of

algorithms exist to find the optimal network without

searching exhaustively, e.g., Monte Carlo Markov Chain

(MCMC) [41] simulation. With the MCMC algorithm,

optimal networks are constructed from a set of starting

conditions. This algorithm is run thousands of times to

identify different plausible networks, each time beginning

with different starting conditions. These most plausible

networks can then be combined to obtain a consensus

network. For each of the reconstructions using the MCMC

algorithm, the starting point is a null network. Small

random changes are made to the network by flipping,

adding, or deleting individual edges, ultimately accepting

those changes that lead to an overall improvement in the fit

of the network to the data. To assess whether a change

improves the network model or not, information measures

such as the Bayesian information criterion (BIC) [42] are

employed, which reduces overfitting by imposing a cost on

the addition of new parameters. This is equivalent to

imposing a lower prior probability P ð M Þ on models with

larger numbers of parameters.

Even though edges in Bayesian networks are directed,

we cannot in general infer causal relationships from the

structure directly, just as discussed above in relation to the

causal inference test. For a network with three nodes, X 1 ,

X 2 , and X 3 , there are multiple groups of structures that are

mathematically equivalent. For example, the three models

An Integrative Genomics Approach to

Constructive Predictive Network Models

Systematically integrating different types of data into

probabilistic networks using Bayesian networks has been

proposed and applied for the purpose of predicting pro-

tein

protein interactions [38] and protein function [39] .

However, these Bayesian networks are still based on

association between nodes in the network, as opposed to

causal relationships. As discussed above for the simple case

of two traits, from these types of networks we cannot infer

whether a specific perturbation will affect a complex

disease trait. To make such predictions we need networks

capable of representing causal relationships. Probabilistic

causal networks are one way to model such relationships

from the top down, where causality again in this context

reflects a probabilistic belief that one node in the network

affects the behavior of another. Bayesian networks [40] are

one type of probabilistic causal network that provide

a natural framework for integrating highly dissimilar types

of data.

Bayesian networks are directed acyclic graphs in which

the edges of the graph are defined by conditional proba-

bilities that characterize the distribution of states of each

node given the state of its parents [40] . The network

topology defines a partitioned joint probability distribution

over all nodes in a network, such that the probability

distribution of states of a node depends only on the states of

its parent nodes: formally, a joint probability distribution

p ð X Þ

X 1 /

X 2 X 2 /

X 3 ;

X 2 /

X 1 X 2 /

X 3 ;

and

X 2 / X 1 X 3 / X 2 , are all Markov equivalent, meaning

that they all encode for the same conditional independence

relationship: X 1 t X 3 j X 2 , X 1 and X 3 are independent

conditional on X 2 . In addition, these models are mathe-

matically equivalent:

p ð X Þ¼ p ð M1 j D Þ¼ p ð X 2 j X 1 Þ p ð X 1 Þ p ð X 3 j X 2 Þ

¼ p ð M2 j D Þ¼ p ð X 1 j X 2 Þ p ð X 2 Þ p ð X 3 j X 2 Þ

¼ p ð M3 j D Þ¼ p ð X 2 j X 3 Þ p ð X 3 Þ p ð X 1 j X 2 Þ

on a set of nodes X can be decomposed as

p ð X Þ¼ Q i

p ð X i

j Pa ð X i

ÞÞ , where Pa ð X i

Þ represents the

parent set of X i . The biological networks of interest we

wish to construct are comprised of nodes that represent

Thus, from correlation data alone we cannot infer

whether X 1 is causal for X 2 or vice versa from these types of

structures.

is worth noting, however,

that

there is

Systems Biology Concepts and Insights

Search WWH ::

Custom Search

Home