Biology Reference
In-Depth Information
Probability (a) is minimized and, from a statistical viewpoint, reflects the
classical frequentist approach exemplified, for instance, by the objective
function introduced by van Helden et al . 15 Probability (b) is maximized
and is inspired by Bayesian statistics; in this case, the motif is part of a
probabilistic generative model such as a hidden Markov model (HMM) 8
or a mixture model. 16 The two types of probabilities will be illustrated by
examples below.
Let us first turn to the question of how to compute the probability
that a motif occurs at least n times in an input sequence set. Here, and
in all following examples, we will assume that motif frequencies were
determined in the “anr” search mode. An exact solution to this prob-
lem is hard to obtain because of the statistical nonindependence of
overlapping words. 17 In fact, the probability distribution of a k -letter
word to occur zero to N times in a sequence of length N
1
depends on the internal repeat structure of the word. To bypass this
difficulty, motif discovery algorithms often rely on approximations,
which are debatable from a mathematical viewpoint. According to the
frequently used Poisson approximation, which assumes independence
between motif occurrences, the probability of finding a motif exactly
n times is given by
+
k
1
i n
) .
(3)
Pr
ob nE
( ,
)
=
E
exp(
-
E
i
i
n
!
Here, E i is the expected number of occurrences of a given motif i , which
is the product of the search space N and the probability p i that a random
sequence of length k constitutes an instance of motif i . The search space
is the number of all possible starting positions for a motif of length k in
the input DNA sequence set.
If the motif description consists of a consensus sequence based on
the four-letter DNA alphabet, with a maximal number of m mismatches
allowed, then the probability p i may be computed as follows:
m
kj
j
-
Ê
Á
ˆ
˜
Â
(
kj
-
)
j
.
p
=
025
.
075
.
(4)
i
j
=
0
Search WWH ::




Custom Search