Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

1

Introduction

Thanks to their ability to capture (conditional) independences and dependences between

variables, probabilistic graphical models (PGMs) offer an adapted framework for a fine

modeling of relationships between variables in an uncertain data framework. A PGM

is a probabilistic model relying on a graph encoding conditional dependences within a

set of random variables. A PGM provides a compact and natural representation of the

joint distribution of the set of variables. Bayesian networks (BNs) are a commonly used

branch of PGMs.

Despite the fact that the observed variables are often sufficient to describe their joint

distribution, sometimes, additional unobserved variables, also named latent variables,

have a role to play. In this context, hierarchical Bayesian networks such as latent tree

models (LTMs), formerly named hierarchical latent class models, were proposed. LTMs

are tree-shaped BNs where leaf nodes are observed while internal nodes are not. LTMs

generalize latent class models (LCMs), defined as containing a unique latent variable

and edges only connecting the latent variable to all the observed variables. In LTMs,

multiple latent variables organized in a hierarchical structure allow to depict a large

variety of relations encompassing local to higher-order dependences (see Figure 1).

LCMs enforce observed variables to be independent, conditionally on the latent varia-

ble. In contrast, LTMs relax this local independence assumption which is often violated

for observed data.

Few algorithms have been developed to learn such models and still fewer for ap-

plications in association genetics [1]. Forests of LTMs have been recently proposed as

potentially useful for association studies [2,3]. In the biomedical research domain, as-

sociation studies rely on the description of DNA variants at characterized genome loci

- or genetic markers - for all subjects in case and control cohorts. Such studies attempt

to identify any putative dependence - or association - between one or possibly some

genetic markers and the affected/unaffected status. In the case of a single causal lo-

cus, a putative association is revealed if the distribution of variants between cases and

controls shows an accumulation of the former with respect to some variant(s). From

now on, we will refer to the most popular genetic markers, that is, Single Nucleotide

Polymorphisms (SNPs).

One of the first motivations to propose this novel model - the forest of LTMs (FLTM)

- is to take account of linkage disequilibrium (LD) in the most possible faithful way.

Linkage disequilibrium occurs because DNA variants close on the chromosome are

scarcely separated by the shuffling of chromosomes (recombination) that takes place

during sex cell formation. Such variants are therefore transmitted together (as an hap-

lotype) from parent to child. Such patterns are at the basis of the so-called haplotype

block structure [4]: ”blocks” where statistical dependences between loci are high al-

ternate with shorter regions characterized by low statistical dependences, the recombi-

nation hotspots. LD is crucial for association studies since a causal locus not sharply

coinciding with a SNP is nevertheless expected to be flanked by SNPs highly likely

to be shown (indirectly) associated with the phenotype. Besides, benefitting from high

correlations is appealing to implement data dimension reduction.

Data dimension reduction exploiting LD is not new to genetics. However, tack-

ling this issue through adapted Bayesian networks has but recently been proposed [3].

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home