Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

structures for the given sequence. For example, the popular Sfold software [1,2] em-

ploys a sampling extension of the partition function (PF) approach [3] to produce sta-

tistically representative subsets of the Boltzmann-weighted ensemble. More recently,

a corresponding probabilistic method has been studied [4] which actually samples the

possible foldings from a distribution implied by a sophisticated stochastic context-free

grammar (SCFG).

Notably, both sampling methods imply the same worst-case time and space require-

ments of

( n 2 ) , respectively, for generating a fix-size sample of random

secondary structures for a given input sequence of length n and can easily be extended

to structure prediction. In fact, a corresponding prediction can be derived from any rep-

resentative statistical sample without increasing the overall time or space complexity.

Thus, the worst-case time and storage requirements for computing structure predictions

via statistical sampling are equal to those of modern state-of-the-art tools like for in-

stance the commonly used minimum free energy (MFE) based Mfold [5,6] and Vienna

RNA [7,8] packages or the popular SCFG based Pfold software [9,10].

Furthermore, applications to structure prediction showed that neither of the two com-

peting sampling approaches (SCFG and PF based method) generally outperforms the

other and consequently, it is not obvious which one should rather be preferred in prac-

tice. This somehow contradicts the fact that the best physics-based prediction methods

still generally perform significantly better than the best probabilistic approaches. In

principle, only if the computational effort of one particular variant could be improved

without significant losses in quality (that is if one of them requires considerably less

time than the others while it sacrifices only little predictive accuracy), then the cor-

responding method would be undoubtably the number one choice for practical appli-

cations, indeed outperforming all other modern computational tools for predicting the

secondary structure of RNA sequences. This, by the way, due to the often quite large

sizes of native RNA molecules considered in practice, meets exactly the demands im-

posed by biologists on computational prediction procedures: rather getting moderately

less accurate (but still good quality) results in less time than needing significantly more

time for obtaining results that are expectedly not considerably more accurate.

Note that recently, there already have been several practical heuristic speedups [11,12].

Particularly, the approach of [11] for folding single RNA sequences manages to speed

up the standard dynamic programming algorithms without sacrificing the optimality of

the results, yielding an expected time complexity of

( n 3 ) and

O

( n 2 ·

ψ ( n )) ,where ψ ( n ) is shown

to be constant on average under standard polymer folding models. In [12], it is shown

how to reduce those average-case time and space complexities in the sparse case. Fur-

thermore, the practical technique from [13] achieves an improved worst-case time com-

plexity of

O

( n 3 / log( n )) , and with the (MFE and SCFG based) algorithms from [14],

a slight worst-case speedup of

O

log(log( n )) 1 / 2 / log( n ) 1 / 2 ) time can be reached

(whose practicality is unlikely and unestablished).

In this article, we present a new way to reduce the worst-case time complexity of

SCFG based statistical sampling by a linear factor, making it possible to predict for

instance the most probable (MP) structure among all sampled foldings for a given in-

put sequence of length n (in direct analogy to conventional structure prediction via

( n 3 ·

O

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home