Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

5

Evaluation

The application software, CFHLC+, is available at http://sites.google.com/site/

raphaelmourad/Home/programmes. It is developed in C++ and relies on the ProBT li-

brary dedicated to Bayesian networks (http://bayesian-programming.org).

The algorithm was tested on datasets describing 10 5 SNPs for 2000 individuals. With

the first version, the running time was around 15 hours for an arbitrary window size of

100 SNPs. When setting the sliding window size δ to 0 . 5 Mb , a reasonable choice to

capture LD, the novel algorithm now runs in less than 12 hours. It has to be emphasized

that as the algorithm runs EM with 10 restarts, a significant improvement has been

brought with respect to the initial version. Finally, the algorithm is shown quasi linear

with the number of SNPs and linear with the sliding window size. Such experimenta-

tions are reported in [3], together with the examination of the robustness with respect

to parameter adjustment.

FLTM was shown to faithfully model linkage disequilibrium. Due to its hierarchical

structure, the multiple layers of an FLTM are expected to describe various degrees of

LD strength. To check this property, the principle was the following: for some given

genomic region, two matrices were compared. The standard triangular matrix M c of

pairwise dependences ( r 2 coefficient) between SNPs was first calculated. Then, for

each pair of SNPs, the latent variable representing the lowest common ancestor (LCA)

was identified. On the other hand, it is easy to compute the mean r 2 over all latent va-

riables located in the same level in the FLTM hierarchy. Thus, each cell of the second

matrix, M d , was assigned the mean r 2 measure associated with the LCA level. For a

visual comparison, a color palette where shade darkens whith increasing dependence

was assigned to M c , whereas a discretized palette was affected to M d . The visual com-

parison of the two plots brilliantly showed that the FLTM faithfully reflects LD strength

variety (see [3]).

In complement, it was also shown that FLTM provides a compact and interpretable

view of LD for the geneticist. Low-level latent variables represent short-range LD and

are interpreted as haplotype shared ancestry. High-level latent variables correspond to

long-range LD, induced by population admixture or natural selection. The flexibility

of FLTM was highlighted in [23], where short-, long- and chromosome-wide linkage

disequilibrium was modeled and visualized.

Equally important for the genetic association purpose is the dimension reduction

aspect, with its consequence, possible bad subsumption. Drastic reductions are observed

as a rule (about 85% ) (see [3]). However, the quality of the information about the child

variables is expected to decrease in a bottom-up fashion, for latent variables. Now the

soundness of FLTM for LD modeling is assessed, a demonstration of the ability to

capture genetic associations is still requested.

6

Protocol to Assess the Suitability of FLTM to Association

Genetics

The objective of the study is to investigate how information about causality fades from

bottom to top in the hierarchy and what are the trends regarding the ratios of latent

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home