Biomedical Engineering Reference
In-Depth Information
( n 2 ) time and space requirements. This complexity improvement
is basically realized by employing an appropriate heuristic instead of the corresponding
exact algorithm for preprocessing the input sequence, i.e. for deriving a “noisy” dis-
tribution (induced by heuristic approximations of the corresponding inside and outside
probabilities) on the entire structure ensemble for the input sequence. From this distri-
bution candidate structures can be efficiently sampled 1 . Moreover, we will consider two
different sampling strategies: (a slight modification of) the widely known sampling pro-
cedure from [1,4] which basically generates a random structure from outside to inside,
and a novel alternative strategy that obeys to contrary principles and employs a reverse
course of action (from inside to outside) and manages to take more advantage of the
approximative preprocessing.
As we will see, even building on our new heuristic preprocessing step, both
sampling strategies can be applied to obtain MP structure predictions of respectable
accuracy. In principle, for sufficiently large sample sizes we obtain a similar high pre-
dictive accuracy as in the case of exact calculations 2 . The seemingly sole pitfall is
that due to the noisy ensemble distribution resulting from approximative computa-
tions, the produced samples are no longer guaranteed to primarily contain rather likely
structures (with respect to the exact distribution of feasible foldings for a given input
sequence), such that we usually have to generate more candidate structures (i.e., con-
sider larger sample sizes) in oder to ensure reproducible structure predictions. However,
this is quite unproblematic in practice: firstly, we can generate the candidate structures
in-place (only the so far most probable structure(s) need(s) to be stored), such that large
sample sizes give no rise to increased memory consumption and secondly, generating
samples can easily be parallelized on modern multi-core architectures or grids.
The rest of this paper is organized as follows: Sect. 2 briefly recaps the principles of
probabilistic statistical sampling and provides the needed formal framework. Sect. 2.3
contains a (short and exemplary) analysis on how different types and levels of distur-
bances in the underlying ensemble distribution affect the resulting sampling quality.
This actually yields an impression on the required precision of an adequate heuristic
approximation scheme for inside-outside calculations. Section 3 describes all facts con-
cerning the approximative preprocessing step that needs to be applied for decreasing
the worst-case time requirements. A (slightly modified) common sampling strategy and
an alternative novel strategy (intended to match well with our heuristic method) are
introduced and opposed in Sect. 4. In Sect. 5, the overall quality of generated sample
sets and their applicability to RNA structure prediction are investigated. We present ex-
periments which show how the prediction accuracy grows with the sample size together
with considerations on how an efficient implementation can deal with large sample sets.
Finally, Sect. 6 concludes the paper.
O
SCFGs) with only
1
With purposive proof-of-concept implementations (in Wolfram Mathematica 7.0), for instance
the overall preprocessing time for E.coli tRNA
Ala (of length n =76 ) could be reduced from
49 . 0 (traditional cubic algorithm) to only 3 . 7 (new quadratic strategy) seconds.
2
Ala , we for instance observed the same sensitivity and PPV values of 1 . 0 and
0 . 91 , respectively, with a particular application of our heuristic method and the corresponding
exact variant.
For E.coli tRNA
 
Search WWH ::




Custom Search