Biology Reference
In-Depth Information
The assumption underlying this formula is that all bases occur with an
equal probability of 0.25 in random sequences. This is the simplest back-
ground (null) model that can be used in this context. Markov chains,
which assume unequal probabilities for different bases and dependencies
between consecutive bases, are more realistic background models for
genomic DNA sequences. Algorithms have been presented for comput-
ing p i for such a model, as well as for consensus sequences including
ambiguous positions represented by IUPAC codes 18 and also for weight
matrices. 19
The Bayesian approach will be illustrated with the mixture model
used by the program MEME. Again, we assume the “arn” search mode.
To circumvent the mathematical difficulties of overlapping words statis-
tics, the input sequence set is usually evaluated as if it were to consist of
N nonoverlapping k- letter subsequences ( N is the search space defined
before). In the simplest case, the mixture model consists of two compo-
nents, a motif model given by a probability matrix and a background
model given by a base probability distribution. The probability of the
sequences given the model is then computed as
(
)
'
j
j
Prob (,
MM q
,
, )
=
qP x M
(
| ) (
+
1
-
qPx M
) ( |
).
(5)
0
0
j
In this notation, x denotes the total set of overlapping k -letter subse-
quences contained in the input sequences, and x j is an individual member
of it. P ( x j | M ) and P ( x j | M 0 ) are the probabilities of subsequence x j given
the motif model and the background model, respectively. q is the mixture
coefficient indicating the sequence-independent probability that a given
subsequence constitutes a motif. The models M and M 0 both define prob-
ability distributions over all k- letter words. The probabilities of sequence
x j under the motif and background models, respectively, are defined as
follows:
k
' 1
j
j
Px
(|
M
)
=
pix
(,
)
(6)
i
i
=
Search WWH ::




Custom Search