Fast RNA Secondary Structure Prediction Using Fuzzy Stochastic Models - Biomedical Engineering Systems and Technologies

Biomedical Engineering Reference

In-Depth Information

Preliminaries

In the sequel, given an RNA molecule r consisting of n nucleotides, we denote the

corresponding sequence fragment from position i to position j , 1

n ,by

R i,j = r i r i +1 ...r j− 1 r j . Accordingly, S i,j denotes a feasible secondary structure on

R i,j .

≤

2.1

Sampling Based on SCFG Models

Briefly, probabilistic sampling based on a suitable SCFG

G s with sets

I G s

and

R G s

of intermediate symbols and productions, respectively, and axiom S

∈I G s (that mod-

els the class of all feasible secondary structures) has two basic steps: In the first step

(preprocessing), all inside probabilities

⇒ lm r i ...r j )

α X ( i,j ):=Pr( X

(1)

and all outside probabilities

⇒ lm r 1 ...r i− 1 Xr j +1 ...r n )

β X ( i,j ):=Pr( S

(2)

for a sequence r of size n , X

n , are computed. According to [4],

this can be done with a special variant of an Earley-style parser (such that the consid-

ered grammar does not need to be in Chomsky normal form (CNF) ), where the grammar

parameters (trained beforehand on a suitable RNA structure database) are splitted into

asetof transition probabilities Pr tr (

∈I G s

and 1

≤

i,j

≤

rule

) for

rule ∈R G s

and two sets of emission

probabilities Pr em (

) for the 4 unpaired bases and the 16 possible base pairings. For

( n 2 ) memory require-

ment for this preprocessing step. Note that in this work, we will use the sophisticated

grammar from [4] which has been parameterized to impose two relevant restrictions

on the class of all feasible structures, namely a minimum length of min HL for hairpin

loops and a minimum number of min hel consecutive base pairs for helices.

The second step takes the form of a recursive sampling algorithm to randomly draw a

complete secondary structure by consecutively sampling substructures (defined by base

pairs and unpaired bases). Notably, different sampling strategies may be employed for

realizing this step; two contrary variants that will be considered within this work are

described in detail in Sect. 4. In general, for any sampling decision (for example choice

of a new base pair), the strategy considers the respective set of all possible choices that

might actually be formed on the currently considered fragment of the input sequence.

Any of these sets contains exactly the mutually and exclusive cases as defined by the al-

ternative productions (of a particular intermediate symbol) of the underlying grammar.

The corresponding random choice is then drawn according to the resulting conditional

sampling distribution (for the considered sequence fragment). This means that the re-

spective sampling distributions are defined by the inside and outside values derived in

step one (providing information on the distribution of all possible choices according to

the actual input sequence) and the grammar parameters (transition probabilities).

G s , there results

( n 3 ) time complexity and

any such SCFG

Biomedical Engineering Systems and Technologies

Search WWH ::

Custom Search

Home