Biomedical Engineering Reference
In-Depth Information
1
Introduction
Thanks to their ability to capture (conditional) independences and dependences between
variables, probabilistic graphical models (PGMs) offer an adapted framework for a fine
modeling of relationships between variables in an uncertain data framework. A PGM
is a probabilistic model relying on a graph encoding conditional dependences within a
set of random variables. A PGM provides a compact and natural representation of the
joint distribution of the set of variables. Bayesian networks (BNs) are a commonly used
branch of PGMs.
Despite the fact that the observed variables are often sufficient to describe their joint
distribution, sometimes, additional unobserved variables, also named latent variables,
have a role to play. In this context, hierarchical Bayesian networks such as latent tree
models (LTMs), formerly named hierarchical latent class models, were proposed. LTMs
are tree-shaped BNs where leaf nodes are observed while internal nodes are not. LTMs
generalize latent class models (LCMs), defined as containing a unique latent variable
and edges only connecting the latent variable to all the observed variables. In LTMs,
multiple latent variables organized in a hierarchical structure allow to depict a large
variety of relations encompassing local to higher-order dependences (see Figure 1).
LCMs enforce observed variables to be independent, conditionally on the latent varia-
ble. In contrast, LTMs relax this local independence assumption which is often violated
for observed data.
Few algorithms have been developed to learn such models and still fewer for ap-
plications in association genetics [1]. Forests of LTMs have been recently proposed as
potentially
useful for association studies [2,3]. In the biomedical research domain, as-
sociation studies rely on the description of DNA variants at characterized genome loci
- or genetic markers - for all subjects in case and control cohorts. Such studies attempt
to identify any putative dependence - or association - between one or possibly some
genetic markers and the affected/unaffected status. In the case of a single causal lo-
cus, a putative association is revealed if the distribution of variants between cases and
controls shows an accumulation of the former with respect to some variant(s). From
now on, we will refer to the most popular genetic markers, that is, Single Nucleotide
Polymorphisms (SNPs).
One of the first motivations to propose this novel model - the forest of LTMs (FLTM)
- is to take account of linkage disequilibrium (LD) in the most possible faithful way.
Linkage disequilibrium occurs because DNA variants close on the chromosome are
scarcely separated by the shuffling of chromosomes (recombination) that takes place
during sex cell formation. Such variants are therefore transmitted together (as an hap-
lotype) from parent to child. Such patterns are at the basis of the so-called haplotype
block structure [4]: ”blocks” where statistical dependences between loci are high al-
ternate with shorter regions characterized by low statistical dependences, the recombi-
nation hotspots. LD is crucial for association studies since a causal locus not sharply
coinciding with a SNP is nevertheless expected to be flanked by SNPs highly likely
to be shown (indirectly) associated with the phenotype. Besides, benefitting from high
correlations is appealing to implement data dimension reduction.
Data dimension reduction exploiting LD is not new to genetics. However, tack-
ling this issue through adapted Bayesian networks has but recently been proposed [3].
Search WWH ::
Custom Search