Using Data Mining Techniques to Probe the Role of Hydrophobic Residues in Protein Folding and Unfolding Simulations - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

ues until every object has been placed in a tree,

also called dendrogram. After the selection of the

algorithm, another critical issue in cluster analysis

is the choice of the “best” metric for assessing the

degree of similarity between the individual objects

being clustered. The cosine function, Pearson's

correlation, Jaccard's coefficient or Euclidean

distance, are some of the commonly used metrics.

The Euclidean distance and correlation measures

have a clear biological meaning, with Euclidean

distances being applied when the goal is to look

for identical patterns, while correlation measures

are used in the cases where trends of the patterns

are the subject of the analysis (Eisen, 1998; Heyer,

1999). Cosine and correlation based measures are

well suited for clustering of both low and high

dimensional datasets (Zhao, 2005) and are data

scale independent, which is not the case of the

Euclidean distance.

In the interest of grouping residues showing

similar SASA profiles along a MD unfolding

simulation, a hierarchical tree - dendrogram - was

built reflecting how the residues of the protein

cluster together, using as similarity measure the

Pearson´s correlation coefficient. The hierarchical

clustering procedure identifies sets of correlated

residues with similar solvent exposure profiles in

each of several MD unfolding pathways, but yields

the cluster information in a tree-like structure. This

makes the identification of the “correct” number

of clusters in the hierarchical clustering solution

very difficult. We devised a method to help the

researcher identify the clusters with well differen-

tiated characteristics (Ferreira, 2007). Additional

information on the amino-acid residues is used to

annotate all the nodes of the dendrogram, and the

clusters are determined taking into consideration

not only the SASA pattern similarity, but also

minimizing intra-cluster variation and maximizing

inter-cluster variation based on the data enriching

the dendrogram. The information used describes

the amino-acid residues chemical characteristics

and behaviour along the MD unfolding trajec-

tories, and consists of the following properties:

(P1) the distance of the residues in the protein

linear sequence; (P2) the spatial distance between

the residues along the MD unfolding trajectory,

which quantifies the overall spatial variation of

the cluster, and measures the deviation of each

residue in relation to a central point of the cluster;

and (P3) the hydrophobic character of the residue

(Radzicka & Wolfenden, 1988). The method can

be summarized in four major steps:

A dendrogram is constructed based on the

SASA variation profiles of the 127 residues,

through agglomerative hierarchical clus-

tering and using as similarity measure the

Pearson's correlation coefficient.

Each node of the dendrogram is annotated

with data on properties P1, P2 and P3 related

to the chemical characteristics and behaviour

of the amino-acid residues along the simula-

tion, by performing a bottom-up traversal of

the dendrogram (the information of a parent

node is calculated based on the values of the

child nodes, reflecting the variability of the

properties of the residues constituting the

cluster).

In a top-down manner, perform a traversal

of the annotated dendrogram: split a clus-

ter in two when significant inter-cluster

variation is detected (above user defined

threshold). Recursively apply the procedure

to the obtained clusters, unless the number

of residues in the clusters reaches a user

defined minimum threshold.

Retrieve the clusters and the information on

their properties.

Association rules

Association rule mining finds interesting associa-

tions and/or correlations among large sets of data

items (Agrawal and Srikant, 1994). Association

rules show attribute/value conditions that occur

frequently together in a given data set. They hold a

simple and clear semantics and are of the form:

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home