Biomedical Engineering Reference
In-Depth Information
2
Preliminaries
In the sequel, given an RNA molecule r consisting of n nucleotides, we denote the
corresponding sequence fragment from position i to position j , 1
n ,by
R i,j = r i r i +1 ...r j− 1 r j . Accordingly, S i,j denotes a feasible secondary structure on
R i,j .
i
j
2.1
Sampling Based on SCFG Models
Briefly, probabilistic sampling based on a suitable SCFG
G s with sets
I G s
and
R G s
of intermediate symbols and productions, respectively, and axiom S
∈I G s (that mod-
els the class of all feasible secondary structures) has two basic steps: In the first step
(preprocessing), all inside probabilities
lm r i ...r j )
α X ( i,j ):=Pr( X
(1)
and all outside probabilities
lm r 1 ...r i− 1 Xr j +1 ...r n )
β X ( i,j ):=Pr( S
(2)
for a sequence r of size n , X
n , are computed. According to [4],
this can be done with a special variant of an Earley-style parser (such that the consid-
ered grammar does not need to be in Chomsky normal form (CNF) ), where the grammar
parameters (trained beforehand on a suitable RNA structure database) are splitted into
asetof transition probabilities Pr tr (
∈I G s
and 1
i,j
rule
) for
rule ∈R G s
and two sets of emission
probabilities Pr em (
·
) for the 4 unpaired bases and the 16 possible base pairings. For
( n 2 ) memory require-
ment for this preprocessing step. Note that in this work, we will use the sophisticated
grammar from [4] which has been parameterized to impose two relevant restrictions
on the class of all feasible structures, namely a minimum length of min HL for hairpin
loops and a minimum number of min hel consecutive base pairs for helices.
The second step takes the form of a recursive sampling algorithm to randomly draw a
complete secondary structure by consecutively sampling substructures (defined by base
pairs and unpaired bases). Notably, different sampling strategies may be employed for
realizing this step; two contrary variants that will be considered within this work are
described in detail in Sect. 4. In general, for any sampling decision (for example choice
of a new base pair), the strategy considers the respective set of all possible choices that
might actually be formed on the currently considered fragment of the input sequence.
Any of these sets contains exactly the mutually and exclusive cases as defined by the al-
ternative productions (of a particular intermediate symbol) of the underlying grammar.
The corresponding random choice is then drawn according to the resulting conditional
sampling distribution (for the considered sequence fragment). This means that the re-
spective sampling distributions are defined by the inside and outside values derived in
step one (providing information on the distribution of all possible choices according to
the actual input sequence) and the grammar parameters (transition probabilities).
G s , there results
O
( n 3 ) time complexity and
O
any such SCFG
 
Search WWH ::




Custom Search