Using Data Mining Techniques to Probe the Role of Hydrophobic Residues in Protein Folding and Unfolding Simulations - Evolving Application Domains of Data Warehousing and Mining

Database Reference

In-Depth Information

IntroductIon

processes under study, while others may be more

sensitive to later stages. Looking at a wide range

of properties and experimental conditions further

increases the amount of data generated by such

simulation models. Analyzing and interpreting

these data requires automated methods such as

data mining. These issues have been addressed

before by Kazmirski et al (1999) and Brito et al

(2004). While Kazmirski et al (1999) presented

several methods based on structure and property

data to compare different MD trajectories, Brito

et al (2004) discussed the usefulness of data

mining techniques, which include machine learn-

ing, artificial intelligence, and visualization, to

address the data analysis problem arising from

multiple computational simulations, including

protein folding and unfolding simulations. Figure

1 depicts a general overview of this process, from

the initial system under study to the interpreta-

tion of the results using data mining tools. The

researcher begins by performing multiple MD

simulations, starting from the same experimental

structure (same atom coordinates) but different

initial atom velocities. For each simulation, a set

of varying atom coordinates and velocities over

time (a trajectory) is obtained. At the end of each

simulation, a collection of molecular properties

may be calculated to characterize the structural

variation of the protein during the process. Finally,

the molecular property variation profiles may be

subjected to analysis using data mining tools.

The solvent accessible surface area (SASA)

is one of the molecular properties that might be

calculated for each MD trajectory. SASA reports

on an important parameter from the protein

conformational stability point of view: solvent

exposure and protein compactness. Its value may

be calculated for the entire protein, but also for

subsets of amino-acid residues, accounting for

example for the polar or non-polar contributions.

Furthermore, the study of the SASA variation of

each individual amino-acid residue provides a

greater level of detail on the individual contri-

butions for the folding or unfolding processes.

Molecular dynamics (MD) is one of the most realis-

tic simulation techniques available to study protein

folding in silico . In MD simulations, the structural

fluctuations of a single protein can be tracked over

time by numerically solving Newton's equations

of motion (Adcock & McCammon, 2006). When

using molecular dynamics simulations to study

protein folding and unfolding processes, multiple

simulations need to be considered to probe the

large conformational space and multidimensional

potential energy surface available to the poly-

peptide chain, and obtain significant statistical

mechanical averages of the system properties

(Brito, 2004; Kazmirski, 1999; Scheraga, 2007).

Even though the computational power available

keeps increasing, it is still a major challenge to

simulate protein folding or unfolding processes

in its real time scale (hundreds of µs to seconds

or more). However, it has been suggested that

performing multiple short simulations (usually 5

to 10) provides better sampling of the conforma-

tional space than having a single long simulation

(Caves, 1998). Thus, performing multiple simula-

tions on the 10 to 100 ns time scale is becoming

routine, which generates huge amounts of data to

be analysed and compared. Furthermore, a large

set of structural and physical properties (such as

root mean square deviation, radius of gyration,

secondary structure content, native contacts, and

solvent accessible surface area) is usually calcu-

lated from the MD trajectories to characterize the

conformational space explored.

Most of the structural and physical properties

calculated from the MD trajectories are easy to

extract. However, the next challenge for data

analysis in multiple MD simulations is to identify,

among the properties, those that are essential in

describing the protein unfolding or folding pro-

cesses. Additionally, it is important to define the

relative importance of each property along the

folding/unfolding pathway. It is expected that some

of the properties best describe initial stages of the

Evolving Application Domains of Data Warehousing and Mining

Search WWH ::

Custom Search

Home