Using Data Mining Techniques to Probe the Role of Hydrophobic Residues in Protein Folding and Unfolding Simulations - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

C ← A 1 & A 2 & . . . & A n

Although, several types of relationships be-

tween amino-acid residues could have been stud-

ied, in this work we focused on the hydrophobic

residues. The hydrophobic effect is considered

to be one of the major driving forces in protein

folding (Dill, 1990; Kyte, 2003; Lins & Brasseur,

1995; Pace, 1996). It arises from entropically

unfavourable arrangements where non-polar side

chains contact water, thus favouring polypeptide

arrangements in which the side chains of hydro-

phobic amino-acids are packed in the interior of

the protein. In fact, about 80% of the hydrophobic

residues' side chains are buried inside a protein

when it folds (Pace, 1996). Thus, hydrophobic

residues usually exhibit small values of solvent

exposure (below 25%) in the protein's folded

state. We set out to find groups of residues, in

particular hydrophobic ones, which change sol-

vent exposure in a coordinated fashion during one

unfolding simulation or across several unfolding

simulations, which might be important in defining

folding nuclei for a protein (Brito, 2004; Ham-

marström & Carlsson, 2000). For each data set,

association rules were extracted such that only

hydrophobic residues with SASA values ≤ 25%

were involved. Because interactions between

hydrophobic groups are weak, it was imposed

that association rules should involve a minimum

of four residues. Association rules were extracted

with minimum support of 30% and minimum

confidence of 90%.

An association rule is a pair of disjoint itemsets

(set of items): the antecedents (A 1 , A 2 , ..., A n ), and

the consequent (C). In general, the consequent

may be a set of items but here we only consider

rules with single item consequents. In the specific

problem of SASA data analysis an item is repre-

sented by the pair residue/SASA. Each association

rule is associated with two values expressing its

degree of uncertainty. The first value is called the

support for the rule, and represents the frequency

of co-occurrence of all items appearing in the rule.

The second value is the confidence of the rule that

represents its accuracy. Confidence is calculated

as the ratio between the support of the rule and

the support of the antecedent.

Finding relations between amino-acid residues

belonging to the same and/or different chemical

classes is of great interest in the understanding of

the protein folding problem. In the present work,

the amino-acids were divided in five different

classes (hydrophobic, hydrophilic, polar with

positive charge, polar with negative charge and

aromatic), and association rules were extracted

among the five classes to study relationships linked

to the main forces driving the folding process: (i)

association rules among hydrophobic residues,

(ii) association rules among hydrophilic and hy-

drophobic residues, (iii) association rules among

aromatic residues, and (iv) association rules among

polar charged residues. Rules were extracted us-

ing CAREN (Azevedo, 2003). CAREN is a Java

based implementation of an association rule engine

that uses a new variant of the ECLAT algorithm

(Zaki, 2000). Several features for rule derivation

and selection are available in CAREN, namely

antecedent and consequent filtering by item or

attribute specification, minimum and maximum

number of items in a rule, and different metrics.

The χ 2 test is one of such metrics. It was applied

during itemset mining as it significantly reduces

the number of relevant itemsets.

reSultS

Here, we report and compare the results obtained

by the application of two data mining techniques

- hierarchical clustering and association rules -

to the analysis of solvent accessible surface area

(SASA) variation profiles of individual amino-

acid residues of the protein transthyretin across

five molecular dynamics unfolding simulations.

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home