Biomedical Engineering Reference
In-Depth Information
of randomly-selected nodes were designed to build the neighbourhood of an incumbent
graph G . Whatever the method used for model estimation (downhill search or simulated
annealing), a severe issue alleviates the efficiency of the model learning algorithm: the
operational characterization of general decomposable graphs is not possible. Therefore,
starting from the incumbent graph G , one has to check a posteriori whether a proposal
G in the neighbourhood of G is decomposable. In the BN-based approach of Lee and
Shatkay, hill climbing with random restarts is constrained by the greedy sparse candi-
date procedure, a standard used to accelerate structure learning [8].
Models with latent variables have also been investigated to model and exploit LD.
Their common characteristic is the use of the SEM (structural expectation maximiza-
tion) procedure. SEM successively optimizes parameters θ , conditional to the structure
S
conditional to θ . In this line, Greenspan and Geiger efficiently infer
ancestral haplotypes using BNs with latent variables and taking account of recombina-
tion hotspots, bottlenecks, genetic drift and mutation [9]. Their model is defined by a
partition of the region under analysis into blocks, with one or more ancestor haplotypes
defined for each block (latent variables). A Markov chain expresses the dependences
between the block genealogies, reflecting the fact that LD exists between blocks as well
as within them. First, an ensemble of local models which are locally optimal are in-
ferred. For this purpose, the search space of models is explored through Gibbs-style
iterations. All model parameters are optimized at each stage of this process, using the
local search and an adapted EM algorithm. The Latent Class Model (LCM) (see In-
troduction) is the basic component of Kimmel and Shamir' s model [10]. The latter is
a collection of LCMs. The principle of the modeling lies in assigning each locus to a
cluster, i.e. an LCM. In this category, Zhang and Ji's approach improves over the pre-
vious block-based approach: a cluster-based approach allows non contiguous SNPs in
thesamecluster[1].
None of the previous approaches is scalable when more than one thousand SNPs
and one thousand individuals are considered. To cope with genome-wide data, various
leads have been explored. Hidden Markov models (HMMs) are a particular class of
PGMs, in which the latent variables are ruled by a Markov chain. Daly et al. developed
an HMM model for LD mapping [4]. In this line, Scheet and Stephens proposed an
HMM-based model where latent states correpond to ancestral haplotypes [11]. To cap-
ture the features inherent to LD, i.e. that cluster-like patterns tend to be local in nature,
an HMM is used: it models the fact that alleles at nearby markers are likely to arise
from the same cluster. Besides, the model can deal with gradual decline of LD with dis-
tance. The corresponding well-known software program, fastPHASE, is accurate and
enables to handle very large datasets (one thousand individuals and hundreds of thou-
sands of SNPs). A variable-length markov chain (VLMC) was proposed by Browning
and Browning [12]. In VLMCs, the memory length can vary along the chain, depending
on the context. Thus, the memory length will be larger for regions of high LD, whereas
it will be small for hotspots, where recombination entails low LD. In contrast to HMMs,
no prespecification of the structure is required to learn a VLMC, such as the number
of ancestral haplotypes, and model learning is performed through a fast heuristic. The
widely used Beagle software is expected to scale to one thousand individuals and one
hundred thousand SNPs.
, then optimizes
S
 
Search WWH ::




Custom Search