Biomedical Engineering Reference
In-Depth Information
quality of generated samples decreases (as indicated by probability profiling for specific
loop types), which is due to the approximated ensemble distribution. As a consequence,
we usually need to use larger sample sizes for obtaining a competitive prediction accu-
racy and stable predictions, i.e., more candidate structures for a given input sequence
have to be generated to ensure that the approximation method outputs rather identical
predictions in independent runs for that sequence. According to our experiments, an ef-
ficient implementation that really takes advantage of the accelerated preprocessing ( 3 . 7
compared to 49 seconds for our proof-of-concept implementation in Wolfram Mathe-
matica) but handles large sample sizes can be obtained by parallelization.
Note that all results presented in this article have been derived with a purposive
proof-of-concept implementation of the described methods. A more sophisticated tool
will be realized in the future, hoping that the proposed prediction approach proves ca-
pable of yielding acceptable accuracies even for such types of RNAs whose molecules
imply a great variety of structural features (due to large sequence lengths). In fact, we
here only considered exemplary applications for simple tRNA sequences (specifically,
for one particular tRNA molecule and a collection of 100 distinct tRNAs, respectively)
in order to get positive feedback that (at least) the MP predictions obtained via approx-
imated SCFG based sampling can be of high quality. Accordingly, more general exper-
iments are needed, e.g., in connection with RNA molecules of sizes n = 3000
30000
(for which the memory constraints of our approach are not restrictive assuming 1GB
of memory for each core) and where long distance base pairs in a global folding are
of interest. In such a scenario, the proposed algorithm could be the method of choice -
provided it performs similarly well.
This line of research is work in progress, but we found the first impressions presented
within this note so motivating that we wanted to share them with the scientific commu-
nity already at this point, primarily because this work leaves a number of open questions
that may be inspiration for further research of other groups. For instance, recall that we
used a sophisticated SCFG (representing a formal language counterpart to the thermo-
dynamic model applied in the Sfold program) as probabilistic basis for the considered
sampling strategies. However, it would also be possible to employ other SCFG designs,
for example one of the commonly known lightweight grammars from [17]. This might
of course yield at least noticeable if not significant changes in the resulting sampling
quality, which could be an interesting subject to be explored.
It should also be noted that a similar approximative approach could potentially be
considered when attempting to reduce the worst-case time complexity of the sampling
extension of the PF approach. In fact, since sequence information is incorporated into
the used (equilibrium) PFs and corresponding sampling probabilities only in the form of
particular sequence-dependent free energy contributions, it seems reasonable to believe
that the time complexity for the forward step (preprocessing) could possibly be reduced
by a linear factor to
( n 2 ) when using some sort of approximated (averaged) free
energy contributions that do not depend on the actual sequence (but contain as much
sequence information as possible), in analogy to the approximated preprocessing step
(inside and outside calculations) considered in this work, where we only had to use
averaged emission terms instead of the exact emission probabilities in order to save
time.
O
 
Search WWH ::




Custom Search