Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

structure. In other words, latent variables capture the information borne by underlying

observed variables ( e.g. genetic markers). In their turn, these latent variables, now play-

ing the role of observed variables, are synthesized through additional latent variables,

andsoon.

The FLTM learning algorithm is now more formaly depicted (see Algorithm 1). The

ascending hierarchical clustering (AHC) process is initiated from the first layer consist-

ing of univariate models. Each such univariate model is built for any observed variable

(lines 2 and 3 ). The termination of the AHC process arises if each cluster identified is

reduced to a singleton (line 7 ) or if no cluster of size at least 2 could be validated (line

18 ). At each step, an LCM is first learnt for each cluster containing at least two nodes

(line 12 ); the validity of the proposed subsumption is then checked (line 13 to 16 ). For

simplification, the cardinality of the latent variable is estimated as an affine function

of the number of variables in the corresponding cluster (line 11 ). After validation, the

LCM is used to enrich the FLTM model (line 14 ): a node corresponding to the new

latent variable L i k is created and connected to the child nodes; the prior distributions of

the child nodes are replaced with distributions conditional on the latent variable. L i k is

added to the set of latent variables, whereas its imputed values, D [ L i k ] , are stored (line

15 ). All variables in C i k are then dismissed and replaced with the latent variable (line

15 ). In contrast, the nodes in unvalidated clusters are kept isolated for the next step.

4.2

Details of the Algorithm

Five points of this algorithm are now detailed. For a start, light is shed on the two

points establishing the difference between the initial version in [3] and the novel ver-

sion, CFHLC+.

Window-Based Data Scan versus Straightforward Data Scan. First, the reader is re-

minded that the initial observed variables are SNPs, which are located along the genome

in a sequence of ”neighbouring” (but generally non contiguous) genetic markers. To

meet the scalability criterion, a divide-and-conquer procedure was implemented in [3]:

the data are scanned through contiguous windows of identical fixed sizes. However,

such splitting is questionable. It entails a bias in the processing of the variables located

in the neighbourhood of the artificial window frontiers. Managing overlapping windows

would not have led to a practicable algorithm. Therefore, a first notable difference with

the algorithm in [3] lies in that the novel version does not require data splitting. Instead,

a simple principle is implemented: not all pairs of variables are processed by the parti-

tioning algorithm. Beyond a physical distance on the chromosome, δ ,specifiedbythe

geneticist, variables are not allowed in the same cluster. Setting the δ constraint actually

corresponds to implementing a sliding window approach.

Partitioning of Variables into Cliques. Standard agglomerative hierarchical cluster-

ing considers a similarity matrix. As a latent variable is intended to connect pairwise

dependent variables, the standard agglomerative approach was adapted accordingly.

Within each window, the previous version runs a clique partitioning algorithm on the

complete graph of pairwise dependences. In the novel version, no complete matrix is

Search WWH ::

Custom Search

Home