Database Reference
In-Depth Information
ues until every object has been placed in a tree,
also called dendrogram. After the selection of the
algorithm, another critical issue in cluster analysis
is the choice of the “best” metric for assessing the
degree of similarity between the individual objects
being clustered. The cosine function, Pearson's
correlation, Jaccard's coefficient or Euclidean
distance, are some of the commonly used metrics.
The Euclidean distance and correlation measures
have a clear biological meaning, with Euclidean
distances being applied when the goal is to look
for identical patterns, while correlation measures
are used in the cases where trends of the patterns
are the subject of the analysis (Eisen, 1998; Heyer,
1999). Cosine and correlation based measures are
well suited for clustering of both low and high
dimensional datasets (Zhao, 2005) and are data
scale independent, which is not the case of the
Euclidean distance.
In the interest of grouping residues showing
similar SASA profiles along a MD unfolding
simulation, a hierarchical tree - dendrogram - was
built reflecting how the residues of the protein
cluster together, using as similarity measure the
Pearson´s correlation coefficient. The hierarchical
clustering procedure identifies sets of correlated
residues with similar solvent exposure profiles in
each of several MD unfolding pathways, but yields
the cluster information in a tree-like structure. This
makes the identification of the “correct” number
of clusters in the hierarchical clustering solution
very difficult. We devised a method to help the
researcher identify the clusters with well differen-
tiated characteristics (Ferreira, 2007). Additional
information on the amino-acid residues is used to
annotate all the nodes of the dendrogram, and the
clusters are determined taking into consideration
not only the SASA pattern similarity, but also
minimizing intra-cluster variation and maximizing
inter-cluster variation based on the data enriching
the dendrogram. The information used describes
the amino-acid residues chemical characteristics
and behaviour along the MD unfolding trajec-
tories, and consists of the following properties:
(P1) the distance of the residues in the protein
linear sequence; (P2) the spatial distance between
the residues along the MD unfolding trajectory,
which quantifies the overall spatial variation of
the cluster, and measures the deviation of each
residue in relation to a central point of the cluster;
and (P3) the hydrophobic character of the residue
(Radzicka & Wolfenden, 1988). The method can
be summarized in four major steps:
1.
A dendrogram is constructed based on the
SASA variation profiles of the 127 residues,
through agglomerative hierarchical clus-
tering and using as similarity measure the
Pearson's correlation coefficient.
2.
Each node of the dendrogram is annotated
with data on properties P1, P2 and P3 related
to the chemical characteristics and behaviour
of the amino-acid residues along the simula-
tion, by performing a bottom-up traversal of
the dendrogram (the information of a parent
node is calculated based on the values of the
child nodes, reflecting the variability of the
properties of the residues constituting the
cluster).
3.
In a top-down manner, perform a traversal
of the annotated dendrogram: split a clus-
ter in two when significant inter-cluster
variation is detected (above user defined
threshold). Recursively apply the procedure
to the obtained clusters, unless the number
of residues in the clusters reaches a user
defined minimum threshold.
4.
Retrieve the clusters and the information on
their properties.
Association rules
Association rule mining finds interesting associa-
tions and/or correlations among large sets of data
items (Agrawal and Srikant, 1994). Association
rules show attribute/value conditions that occur
frequently together in a given data set. They hold a
simple and clear semantics and are of the form:
Search WWH ::




Custom Search