Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

of randomly-selected nodes were designed to build the neighbourhood of an incumbent

graph G . Whatever the method used for model estimation (downhill search or simulated

annealing), a severe issue alleviates the efficiency of the model learning algorithm: the

operational characterization of general decomposable graphs is not possible. Therefore,

starting from the incumbent graph G , one has to check a posteriori whether a proposal

G in the neighbourhood of G is decomposable. In the BN-based approach of Lee and

Shatkay, hill climbing with random restarts is constrained by the greedy sparse candi-

date procedure, a standard used to accelerate structure learning [8].

Models with latent variables have also been investigated to model and exploit LD.

Their common characteristic is the use of the SEM (structural expectation maximiza-

tion) procedure. SEM successively optimizes parameters θ , conditional to the structure

S

conditional to θ . In this line, Greenspan and Geiger efficiently infer

ancestral haplotypes using BNs with latent variables and taking account of recombina-

tion hotspots, bottlenecks, genetic drift and mutation [9]. Their model is defined by a

partition of the region under analysis into blocks, with one or more ancestor haplotypes

defined for each block (latent variables). A Markov chain expresses the dependences

between the block genealogies, reflecting the fact that LD exists between blocks as well

as within them. First, an ensemble of local models which are locally optimal are in-

ferred. For this purpose, the search space of models is explored through Gibbs-style

iterations. All model parameters are optimized at each stage of this process, using the

local search and an adapted EM algorithm. The Latent Class Model (LCM) (see In-

troduction) is the basic component of Kimmel and Shamir' s model [10]. The latter is

a collection of LCMs. The principle of the modeling lies in assigning each locus to a

cluster, i.e. an LCM. In this category, Zhang and Ji's approach improves over the pre-

vious block-based approach: a cluster-based approach allows non contiguous SNPs in

thesamecluster[1].

None of the previous approaches is scalable when more than one thousand SNPs

and one thousand individuals are considered. To cope with genome-wide data, various

leads have been explored. Hidden Markov models (HMMs) are a particular class of

PGMs, in which the latent variables are ruled by a Markov chain. Daly et al. developed

an HMM model for LD mapping [4]. In this line, Scheet and Stephens proposed an

HMM-based model where latent states correpond to ancestral haplotypes [11]. To cap-

ture the features inherent to LD, i.e. that cluster-like patterns tend to be local in nature,

an HMM is used: it models the fact that alleles at nearby markers are likely to arise

from the same cluster. Besides, the model can deal with gradual decline of LD with dis-

tance. The corresponding well-known software program, fastPHASE, is accurate and

enables to handle very large datasets (one thousand individuals and hundreds of thou-

sands of SNPs). A variable-length markov chain (VLMC) was proposed by Browning

and Browning [12]. In VLMCs, the memory length can vary along the chain, depending

on the context. Thus, the memory length will be larger for regions of high LD, whereas

it will be small for hotspots, where recombination entails low LD. In contrast to HMMs,

no prespecification of the structure is required to learn a VLMC, such as the number

of ancestral haplotypes, and model learning is performed through a fast heuristic. The

widely used Beagle software is expected to scale to one thousand individuals and one

hundred thousand SNPs.

, then optimizes

S

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home