Biomedical Engineering Reference
In-Depth Information
structure. In other words, latent variables capture the information borne by underlying
observed variables ( e.g. genetic markers). In their turn, these latent variables, now play-
ing the role of observed variables, are synthesized through additional latent variables,
andsoon.
The FLTM learning algorithm is now more formaly depicted (see Algorithm 1). The
ascending hierarchical clustering (AHC) process is initiated from the first layer consist-
ing of univariate models. Each such univariate model is built for any observed variable
(lines 2 and 3 ). The termination of the AHC process arises if each cluster identified is
reduced to a singleton (line 7 ) or if no cluster of size at least 2 could be validated (line
18 ). At each step, an LCM is first learnt for each cluster containing at least two nodes
(line 12 ); the validity of the proposed subsumption is then checked (line 13 to 16 ). For
simplification, the cardinality of the latent variable is estimated as an affine function
of the number of variables in the corresponding cluster (line 11 ). After validation, the
LCM is used to enrich the FLTM model (line 14 ): a node corresponding to the new
latent variable L i k is created and connected to the child nodes; the prior distributions of
the child nodes are replaced with distributions conditional on the latent variable. L i k is
added to the set of latent variables, whereas its imputed values, D [ L i k ] , are stored (line
15 ). All variables in C i k are then dismissed and replaced with the latent variable (line
15 ). In contrast, the nodes in unvalidated clusters are kept isolated for the next step.
4.2
Details of the Algorithm
Five points of this algorithm are now detailed. For a start, light is shed on the two
points establishing the difference between the initial version in [3] and the novel ver-
sion, CFHLC+.
Window-Based Data Scan versus Straightforward Data Scan. First, the reader is re-
minded that the initial observed variables are SNPs, which are located along the genome
in a sequence of ”neighbouring” (but generally non contiguous) genetic markers. To
meet the scalability criterion, a divide-and-conquer procedure was implemented in [3]:
the data are scanned through contiguous windows of identical fixed sizes. However,
such splitting is questionable. It entails a bias in the processing of the variables located
in the neighbourhood of the artificial window frontiers. Managing overlapping windows
would not have led to a practicable algorithm. Therefore, a first notable difference with
the algorithm in [3] lies in that the novel version does not require data splitting. Instead,
a simple principle is implemented: not all pairs of variables are processed by the parti-
tioning algorithm. Beyond a physical distance on the chromosome, δ ,specifiedbythe
geneticist, variables are not allowed in the same cluster. Setting the δ constraint actually
corresponds to implementing a sliding window approach.
Partitioning of Variables into Cliques. Standard agglomerative hierarchical cluster-
ing considers a similarity matrix. As a latent variable is intended to connect pairwise
dependent variables, the standard agglomerative approach was adapted accordingly.
Within each window, the previous version runs a clique partitioning algorithm on the
complete graph of pairwise dependences. In the novel version, no complete matrix is
 
Search WWH ::




Custom Search