Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

( n 2 ) time and space requirements. This complexity improvement

is basically realized by employing an appropriate heuristic instead of the corresponding

exact algorithm for preprocessing the input sequence, i.e. for deriving a “noisy” dis-

tribution (induced by heuristic approximations of the corresponding inside and outside

probabilities) on the entire structure ensemble for the input sequence. From this distri-

bution candidate structures can be efficiently sampled 1 . Moreover, we will consider two

different sampling strategies: (a slight modification of) the widely known sampling pro-

cedure from [1,4] which basically generates a random structure from outside to inside,

and a novel alternative strategy that obeys to contrary principles and employs a reverse

course of action (from inside to outside) and manages to take more advantage of the

approximative preprocessing.

As we will see, even building on our new heuristic preprocessing step, both

sampling strategies can be applied to obtain MP structure predictions of respectable

accuracy. In principle, for sufficiently large sample sizes we obtain a similar high pre-

dictive accuracy as in the case of exact calculations 2 . The seemingly sole pitfall is

that due to the noisy ensemble distribution resulting from approximative computa-

tions, the produced samples are no longer guaranteed to primarily contain rather likely

structures (with respect to the exact distribution of feasible foldings for a given input

sequence), such that we usually have to generate more candidate structures (i.e., con-

sider larger sample sizes) in oder to ensure reproducible structure predictions. However,

this is quite unproblematic in practice: firstly, we can generate the candidate structures

in-place (only the so far most probable structure(s) need(s) to be stored), such that large

sample sizes give no rise to increased memory consumption and secondly, generating

samples can easily be parallelized on modern multi-core architectures or grids.

The rest of this paper is organized as follows: Sect. 2 briefly recaps the principles of

probabilistic statistical sampling and provides the needed formal framework. Sect. 2.3

contains a (short and exemplary) analysis on how different types and levels of distur-

bances in the underlying ensemble distribution affect the resulting sampling quality.

This actually yields an impression on the required precision of an adequate heuristic

approximation scheme for inside-outside calculations. Section 3 describes all facts con-

cerning the approximative preprocessing step that needs to be applied for decreasing

the worst-case time requirements. A (slightly modified) common sampling strategy and

an alternative novel strategy (intended to match well with our heuristic method) are

introduced and opposed in Sect. 4. In Sect. 5, the overall quality of generated sample

sets and their applicability to RNA structure prediction are investigated. We present ex-

periments which show how the prediction accuracy grows with the sample size together

with considerations on how an efficient implementation can deal with large sample sets.

Finally, Sect. 6 concludes the paper.

O

SCFGs) with only

1

With purposive proof-of-concept implementations (in Wolfram Mathematica 7.0), for instance

the overall preprocessing time for E.coli tRNA

Ala (of length n =76 ) could be reduced from

49 . 0 (traditional cubic algorithm) to only 3 . 7 (new quadratic strategy) seconds.

2

Ala , we for instance observed the same sensitivity and PPV values of 1 . 0 and

0 . 91 , respectively, with a particular application of our heuristic method and the corresponding

exact variant.

For E.coli tRNA

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home