Biomedical Engineering Reference
In-Depth Information
sample set as defined in [4], since those effectively reflect the overall behavior of the
sample. Those predictions must anyway be considered inappropriate choices in our
case, since their computation requires
( n 3 ) time, which would inevitably undo the
time reduction reached by approximating (expect for in the case of γ -centroid structures
for the default choice γ =1 , i.e. unique centroids, as those can be derived in
O
( n 2 )
time). Nevertheless, we can without significant losses in performance (without increas-
ing the worst-case time complexity of the overall algorithm) identify the MP structure(s)
of the generated sample 6 , in strong analogy to traditional SCFG approaches. Since for
this selection principle, we can actually rely on the exact distribution of feasible struc-
tures 7 , this seems to be the right choice indeed.
It should be mentioned that one could, however, alternatively consider a particular
subset of the overall sample that contains only those candidate foldings satisfying a
preliminary defined quality criterion (e.g., only structures with probabilities above a
specified threshold or with not less than a specified minimum number of base pairs).
This means candidate folding of low quality are disregarded, such that constructing γ -
MEA or γ -centroid structures might then result in reliable predictions (where only the
derivation of the unique centroid is reasonable with respect to time complexity). No-
tably, in this context, one should apply a corresponding rejection scheme during sample
composition (i.e., generated structures are only added to the sample set if they meet
the preliminary specified requirements, otherwise they are rejected), since this is obvi-
ously more efficient than collecting all generated structures and afterwards employing
a corresponding filtering process in order to identify the subset to be considered. Uti-
lizing a reasonable rejection criterion, it actually becomes possible to generate any de-
sired number of candidate foldings that obey to the imposed restrictions: new structures
are generated until the resulting sample set is large enough. In connection with noisy
ensemble distributions, choosing only moderate sizes for such filtered samples might
suffice under certain circumstances, which might then result in a reduced runtime com-
pared to the generation of large unfiltered sample sets. Anyway, in this work we will
exclusively consider unfiltered samples and MP predictions.
On the basis of a series of experiments, we observed that stability in resulting pre-
dictions and a competitive prediction accuracy can only be reached by increasing the
sample size, especially in the case of complete approximation for the preprocessing
step and sampling according to the alternative strategy introduced in Sect. 4.2. That
is, more candidate structures ought to be generated for guaranteeing that the resulting
MP predictions are reproducible (by independent runs for the same input sequence)
and of hight quality. This negative effect is considerably lowered by using (larger) con-
stant values of W exact
O
0 , and is actually less recognizable when employing the
6
The probability of each structure can either be determined on the fly while sampling, multi-
plying the probabilities of the production rules which correspond to the respective sampling
decisions, and otherwise - since the underlying SCFG from [4] is unambiguous - are com-
putable in O ( n 2 ) time making use, e.g., of an Earley-style parser.
7
Note that the probability for a particular folding of a given RNA sequence is equal to a product
of (different powers of the diverse) transition and emission probabilities (according to the
corresponding derivation tree), which means it depends only on the exact trained parameter
values of the underlying SCFG.
 
Search WWH ::




Custom Search