Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

sample set as defined in [4], since those effectively reflect the overall behavior of the

sample. Those predictions must anyway be considered inappropriate choices in our

case, since their computation requires

( n 3 ) time, which would inevitably undo the

time reduction reached by approximating (expect for in the case of γ -centroid structures

for the default choice γ =1 , i.e. unique centroids, as those can be derived in

O

( n 2 )

time). Nevertheless, we can without significant losses in performance (without increas-

ing the worst-case time complexity of the overall algorithm) identify the MP structure(s)

of the generated sample 6 , in strong analogy to traditional SCFG approaches. Since for

this selection principle, we can actually rely on the exact distribution of feasible struc-

tures 7 , this seems to be the right choice indeed.

It should be mentioned that one could, however, alternatively consider a particular

subset of the overall sample that contains only those candidate foldings satisfying a

preliminary defined quality criterion (e.g., only structures with probabilities above a

specified threshold or with not less than a specified minimum number of base pairs).

This means candidate folding of low quality are disregarded, such that constructing γ -

MEA or γ -centroid structures might then result in reliable predictions (where only the

derivation of the unique centroid is reasonable with respect to time complexity). No-

tably, in this context, one should apply a corresponding rejection scheme during sample

composition (i.e., generated structures are only added to the sample set if they meet

the preliminary specified requirements, otherwise they are rejected), since this is obvi-

ously more efficient than collecting all generated structures and afterwards employing

a corresponding filtering process in order to identify the subset to be considered. Uti-

lizing a reasonable rejection criterion, it actually becomes possible to generate any de-

sired number of candidate foldings that obey to the imposed restrictions: new structures

are generated until the resulting sample set is large enough. In connection with noisy

ensemble distributions, choosing only moderate sizes for such filtered samples might

suffice under certain circumstances, which might then result in a reduced runtime com-

pared to the generation of large unfiltered sample sets. Anyway, in this work we will

exclusively consider unfiltered samples and MP predictions.

On the basis of a series of experiments, we observed that stability in resulting pre-

dictions and a competitive prediction accuracy can only be reached by increasing the

sample size, especially in the case of complete approximation for the preprocessing

step and sampling according to the alternative strategy introduced in Sect. 4.2. That

is, more candidate structures ought to be generated for guaranteeing that the resulting

MP predictions are reproducible (by independent runs for the same input sequence)

and of hight quality. This negative effect is considerably lowered by using (larger) con-

stant values of W exact

O

≥

0 , and is actually less recognizable when employing the

6

The probability of each structure can either be determined on the fly while sampling, multi-

plying the probabilities of the production rules which correspond to the respective sampling

decisions, and otherwise - since the underlying SCFG from [4] is unambiguous - are com-

putable in O ( n 2 ) time making use, e.g., of an Earley-style parser.

7

Note that the probability for a particular folding of a given RNA sequence is equal to a product

of (different powers of the diverse) transition and emission probabilities (according to the

corresponding derivation tree), which means it depends only on the exact trained parameter

values of the underlying SCFG.

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home