Biology Reference
In-Depth Information
k
'
j
j
( .
Px
(|
M
)
=
p x
(7)
0
0
i
i
=
1
Note that the mixture coefficient q is part of the model and, thus, the tar-
get of optimization by the motif discovery algorithm. It plays a similar role
as the threshold value of allowed mismatches to a consensus sequence in
the frequentist motif evaluation framework. This raises the interesting
question of whether the two approaches are equivalent with regard to
defining the optimal threshold value for a consensus sequence or weight
matrix. The answer is, to my knowledge, not known.
3.2. Scanning the Search Space
How do we find the best motif among all possible motifs? Three strate-
gies can be distinguished depending on the structure of the search space,
which may consist of
(a)
all k- letter words of a given alphabet;
(b)
all probability matrices of length k ; or
(c)
all motif annotations (all subsets of k -letter subsequences of input
sequences).
3.2.1. Finding the best consensus sequence
For consensus sequences based on the four-letter DNA alphabet, the size of
the search space remains computationally manageable up to a word length
of about 15. The optimal motif can thus be found by enumeration, i.e.
by evaluating a frequentist-type objective function for each k- letter word.
Algorithms to this end are reasonably fast, as the input sequences need
only to be scanned once. The word index is first initialized with zeros.
Then, one word frequency is incremented each time a subsequence is
processed (if mismatches are allowed, multiple motif frequencies are
updated for one subsequence). For longer words, heuristic algorithms
have to be used instead of exact methods. An old trick, introduced in the
Search WWH ::




Custom Search