Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

quality of generated samples decreases (as indicated by probability profiling for specific

loop types), which is due to the approximated ensemble distribution. As a consequence,

we usually need to use larger sample sizes for obtaining a competitive prediction accu-

racy and stable predictions, i.e., more candidate structures for a given input sequence

have to be generated to ensure that the approximation method outputs rather identical

predictions in independent runs for that sequence. According to our experiments, an ef-

ficient implementation that really takes advantage of the accelerated preprocessing ( 3 . 7

compared to 49 seconds for our proof-of-concept implementation in Wolfram Mathe-

matica) but handles large sample sizes can be obtained by parallelization.

Note that all results presented in this article have been derived with a purposive

proof-of-concept implementation of the described methods. A more sophisticated tool

will be realized in the future, hoping that the proposed prediction approach proves ca-

pable of yielding acceptable accuracies even for such types of RNAs whose molecules

imply a great variety of structural features (due to large sequence lengths). In fact, we

here only considered exemplary applications for simple tRNA sequences (specifically,

for one particular tRNA molecule and a collection of 100 distinct tRNAs, respectively)

in order to get positive feedback that (at least) the MP predictions obtained via approx-

imated SCFG based sampling can be of high quality. Accordingly, more general exper-

iments are needed, e.g., in connection with RNA molecules of sizes n = 3000

30000

(for which the memory constraints of our approach are not restrictive assuming 1GB

of memory for each core) and where long distance base pairs in a global folding are

of interest. In such a scenario, the proposed algorithm could be the method of choice -

provided it performs similarly well.

This line of research is work in progress, but we found the first impressions presented

within this note so motivating that we wanted to share them with the scientific commu-

nity already at this point, primarily because this work leaves a number of open questions

that may be inspiration for further research of other groups. For instance, recall that we

used a sophisticated SCFG (representing a formal language counterpart to the thermo-

dynamic model applied in the Sfold program) as probabilistic basis for the considered

sampling strategies. However, it would also be possible to employ other SCFG designs,

for example one of the commonly known lightweight grammars from [17]. This might

of course yield at least noticeable if not significant changes in the resulting sampling

quality, which could be an interesting subject to be explored.

It should also be noted that a similar approximative approach could potentially be

considered when attempting to reduce the worst-case time complexity of the sampling

extension of the PF approach. In fact, since sequence information is incorporated into

the used (equilibrium) PFs and corresponding sampling probabilities only in the form of

particular sequence-dependent free energy contributions, it seems reasonable to believe

that the time complexity for the forward step (preprocessing) could possibly be reduced

by a linear factor to

−

( n 2 ) when using some sort of approximated (averaged) free

energy contributions that do not depend on the actual sequence (but contain as much

sequence information as possible), in analogy to the approximated preprocessing step

(inside and outside calculations) considered in this work, where we only had to use

averaged emission terms instead of the exact emission probabilities in order to save

time.

O

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home