Using Data Mining Techniques to Probe the Role of Hydrophobic Residues in Protein Folding and Unfolding Simulations - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

clustering

on its composition. Three scenarios are possible.

First, if NLN is formed between two leaf nodes

(i.e. two residues) then P1 is calculated as the

difference of the positions of the two residues in

the protein linear sequence, and P2 as the aver-

age value of the Euclidean distances between the

coordinates of the C α atoms of the two residues

along the simulation. Second, if NLN is the root

of a leaf node (residue) and another subtree, the

properties are assigned with the average of two

values: (i) the Euclidean distance between the

property value in the residue and the centroid

of the subtree, and (ii) the property value in the

subtree. Third, when NLN is composed by two

subtrees, properties P1 and P2 are calculated as

the average of the property values annotating the

roots of the two subtrees. Property P3 is always

calculated as the average value of the residues

present in the leave nodes of the subtree of which

NLN is the root.

A hierarchical clustering procedure was applied to

each one of the five data sets to search for residues

exhibiting similar SASA variation profiles. Prior to

computation of the clustering solutions, each data

set was normalised using the classic zero-mean and

unit-standard deviation technique. Following the

clustering procedure, we searched for prevalent

correlations among the clusters determined for

each data set.

Dendrogram Construction

Clustering solutions using the agglomerative para-

digm were computed with the program vcluster ,

distributed with the clustering toolkit CLUTO

(Karypis, 2003). For each data set, the similar-

ity between the solvent accessible surface area

(SASA) variation profiles of the 127 residues of

WT-TTR was assessed using the Pearson correla-

tion coefficient. Different parameters were chosen

to produce the best clustering solution. The option

crfun was selected to improve intra-cluster quality,

and the internal criterion function i2 was chosen

to maximize the similarity between each residue

and the centroid of the cluster it is assigned to.

CLUTO requires the number of desired clusters

to be defined, and this parameter was empirically

set to 10.

Cluster Assessment

For each data set, the annotated dendrograms

were top-down traversed and the clusters deter-

mined based on two criteria: (i) a cluster must be

composed by at least four amino-acid residues,

and (ii) two clusters should differ by a threshold

of 0.8. The dissimilarity of two clusters is calcu-

lated based on the Euclidean distance between

the triples annotating the root of the subtrees that

defined them.

The number of clusters obtained for the five

data sets is different, ranging between 16 and 24,

with the number of amino-acid residues per cluster

spanning from 4 to 10.

Dendrogram Annotation

Following the construction of the dendrograms,

each non-leaf node was annotated by perform-

ing a bottom-up traversal of the tree, i.e. from

the leaves (residues) to the root. The annotation

consists of a triple with information on the follow-

ing properties: (P1) the distance of the residues

in the protein linear sequence, (P2) the spatial

distance of the residues along the MD unfolding

trajectory, and (P3) the hydrophobic character

of the residues. For each non-leaf node (NLN),

the calculation of properties P1 and P2 depends

Searching for Cluster Conservation

across Multiple Data Sets

Notwithstanding the importance of the clusters

found, correlating the SASA variation profiles

of the individual amino-acid residues in each of

the five data sets, if one was able to find groups

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home