Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

( X i ,X j )=

( X i )

( X j ) and the dependent model

( X i |

X j )

( X j ) :

( X i )+

( X j )

( X j ) =

−

( X i |

X j )+

( X i )

−H

( X i |

X j ) . The larger the difference

between entropies, the higher is the dependence.

Definition 5 (Allele, SNPs, Haplotype) . Due to the presence of pairs of chromosomes

in the human genome, the DNA at a given chromosome locus (SNP) may either be de-

scribed through a pair of variants (alleles or phased data) at the finer description level

or through a unique variant (unphased data). As SNPs are biallelic, only two alleles are

encountered at the corresponding loci (instead of the 4 possible nucleotides A,T,C,G).

Thus, SNPs are discrete variables whose three possible values may be coded as, say,

0 , 1 and 2 , to respectively account for aa ,

(usually not distinguishable) and

AA ,where A and a are the two alleles. An haplotype is defined as a sequence of alleles.

{

Aa, aA

}

Motivation and Related Work

3.1

Motivation

To tackle the difficult problem of disease association detection, several algorithms com-

ing from the machine learning domain have been proposed. Some of them use PGMs

[5,6]. Recently, forests of latent tree models have been investigated for LD modeling

purpose [3]. A forest of latent tree models (FLTM) is a forest whose trees are LTMs

(see Figure 1). FLTMs generalize LTMs, since the variables are not constrained to be

dependent upon one another, either directly or indirectly. Thus, FLTMs can describe a

larger set of configurations than LTMs.

When modeling such highly correlated variables as those in genotypic data, the chal-

lenge is all the more crucial for downstream analyses such as study and visualization of

linkage disequilibrium, mapping of disease susceptibility genetic patterns and study of

population structure. Most notably, the benefits of using FLTMs to model LD rely on

their ability to account for multiple degrees of SNP dependences and to naturally deal

with the fuzzy nature of LD block boundaries. As will further be emphasized, this latter

advantage results from the FLTM learning algorithm, which does not impose that the

SNPs subsumed by the same latent variable be neighbouring SNPs (along the genome).

3.2

Probabilistic Graphical Models to Model Linkage Disequilibrium

The FLTM-based model is meant as an improving alternative over other PGM-based

works addressing LD modeling. Besides learning of parameters ( θ ), that is apriori

and conditional probabilities for Bayesian networks, and probability distributions for

cliques and separators for Markov random fields, the most challenging task in PGM

learning is structure inference. Thomas and Camp pioneered the use of PGMs to model

LD [7]. To reach this aim, their approach relies on the general class of decomposable

Markov Random Fields (DMRF). Decomposable graphs allow the efficient computa-

tion of the likelihood of the structure, given the data. Thus, structure learning is eas-

ily performed navigating the structure space while optimizing a log-likelihood-based

score. To explore the DMRF space, operations based on connection or disconnection

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home