Database Reference
In-Depth Information
clustering
on its composition. Three scenarios are possible.
First, if NLN is formed between two leaf nodes
(i.e. two residues) then P1 is calculated as the
difference of the positions of the two residues in
the protein linear sequence, and P2 as the aver-
age value of the Euclidean distances between the
coordinates of the C α atoms of the two residues
along the simulation. Second, if NLN is the root
of a leaf node (residue) and another subtree, the
properties are assigned with the average of two
values: (i) the Euclidean distance between the
property value in the residue and the centroid
of the subtree, and (ii) the property value in the
subtree. Third, when NLN is composed by two
subtrees, properties P1 and P2 are calculated as
the average of the property values annotating the
roots of the two subtrees. Property P3 is always
calculated as the average value of the residues
present in the leave nodes of the subtree of which
NLN is the root.
A hierarchical clustering procedure was applied to
each one of the five data sets to search for residues
exhibiting similar SASA variation profiles. Prior to
computation of the clustering solutions, each data
set was normalised using the classic zero-mean and
unit-standard deviation technique. Following the
clustering procedure, we searched for prevalent
correlations among the clusters determined for
each data set.
Dendrogram Construction
Clustering solutions using the agglomerative para-
digm were computed with the program vcluster ,
distributed with the clustering toolkit CLUTO
(Karypis, 2003). For each data set, the similar-
ity between the solvent accessible surface area
(SASA) variation profiles of the 127 residues of
WT-TTR was assessed using the Pearson correla-
tion coefficient. Different parameters were chosen
to produce the best clustering solution. The option
crfun was selected to improve intra-cluster quality,
and the internal criterion function i2 was chosen
to maximize the similarity between each residue
and the centroid of the cluster it is assigned to.
CLUTO requires the number of desired clusters
to be defined, and this parameter was empirically
set to 10.
Cluster Assessment
For each data set, the annotated dendrograms
were top-down traversed and the clusters deter-
mined based on two criteria: (i) a cluster must be
composed by at least four amino-acid residues,
and (ii) two clusters should differ by a threshold
of 0.8. The dissimilarity of two clusters is calcu-
lated based on the Euclidean distance between
the triples annotating the root of the subtrees that
defined them.
The number of clusters obtained for the five
data sets is different, ranging between 16 and 24,
with the number of amino-acid residues per cluster
spanning from 4 to 10.
Dendrogram Annotation
Following the construction of the dendrograms,
each non-leaf node was annotated by perform-
ing a bottom-up traversal of the tree, i.e. from
the leaves (residues) to the root. The annotation
consists of a triple with information on the follow-
ing properties: (P1) the distance of the residues
in the protein linear sequence, (P2) the spatial
distance of the residues along the MD unfolding
trajectory, and (P3) the hydrophobic character
of the residues. For each non-leaf node (NLN),
the calculation of properties P1 and P2 depends
Searching for Cluster Conservation
across Multiple Data Sets
Notwithstanding the importance of the clusters
found, correlating the SASA variation profiles
of the individual amino-acid residues in each of
the five data sets, if one was able to find groups
Search WWH ::




Custom Search