Biomedical Engineering Reference
In-Depth Information
structures for the given sequence. For example, the popular Sfold software [1,2] em-
ploys a sampling extension of the partition function (PF) approach [3] to produce sta-
tistically representative subsets of the Boltzmann-weighted ensemble. More recently,
a corresponding probabilistic method has been studied [4] which actually samples the
possible foldings from a distribution implied by a sophisticated stochastic context-free
grammar (SCFG).
Notably, both sampling methods imply the same worst-case time and space require-
ments of
( n 2 ) , respectively, for generating a fix-size sample of random
secondary structures for a given input sequence of length n and can easily be extended
to structure prediction. In fact, a corresponding prediction can be derived from any rep-
resentative statistical sample without increasing the overall time or space complexity.
Thus, the worst-case time and storage requirements for computing structure predictions
via statistical sampling are equal to those of modern state-of-the-art tools like for in-
stance the commonly used minimum free energy (MFE) based Mfold [5,6] and Vienna
RNA [7,8] packages or the popular SCFG based Pfold software [9,10].
Furthermore, applications to structure prediction showed that neither of the two com-
peting sampling approaches (SCFG and PF based method) generally outperforms the
other and consequently, it is not obvious which one should rather be preferred in prac-
tice. This somehow contradicts the fact that the best physics-based prediction methods
still generally perform significantly better than the best probabilistic approaches. In
principle, only if the computational effort of one particular variant could be improved
without significant losses in quality (that is if one of them requires considerably less
time than the others while it sacrifices only little predictive accuracy), then the cor-
responding method would be undoubtably the number one choice for practical appli-
cations, indeed outperforming all other modern computational tools for predicting the
secondary structure of RNA sequences. This, by the way, due to the often quite large
sizes of native RNA molecules considered in practice, meets exactly the demands im-
posed by biologists on computational prediction procedures: rather getting moderately
less accurate (but still good quality) results in less time than needing significantly more
time for obtaining results that are expectedly not considerably more accurate.
Note that recently, there already have been several practical heuristic speedups [11,12].
Particularly, the approach of [11] for folding single RNA sequences manages to speed
up the standard dynamic programming algorithms without sacrificing the optimality of
the results, yielding an expected time complexity of
( n 3 ) and
O
O
( n 2 ·
ψ ( n )) ,where ψ ( n ) is shown
to be constant on average under standard polymer folding models. In [12], it is shown
how to reduce those average-case time and space complexities in the sparse case. Fur-
thermore, the practical technique from [13] achieves an improved worst-case time com-
plexity of
O
( n 3 / log( n )) , and with the (MFE and SCFG based) algorithms from [14],
a slight worst-case speedup of
O
log(log( n )) 1 / 2 / log( n ) 1 / 2 ) time can be reached
(whose practicality is unlikely and unestablished).
In this article, we present a new way to reduce the worst-case time complexity of
SCFG based statistical sampling by a linear factor, making it possible to predict for
instance the most probable (MP) structure among all sampled foldings for a given in-
put sequence of length n (in direct analogy to conventional structure prediction via
( n 3 ·
O
 
Search WWH ::




Custom Search