Methods for Discovery and Characterization of DNA Sequence Motifs - Bioinformatics: A Swiss Perspective

Biology Reference

In-Depth Information

k

'

j

( .

Px

(|

M

)

=

p x

(7)

0

i

=

1

Note that the mixture coefficient q is part of the model and, thus, the tar-

get of optimization by the motif discovery algorithm. It plays a similar role

as the threshold value of allowed mismatches to a consensus sequence in

the frequentist motif evaluation framework. This raises the interesting

question of whether the two approaches are equivalent with regard to

defining the optimal threshold value for a consensus sequence or weight

matrix. The answer is, to my knowledge, not known.

3.2. Scanning the Search Space

How do we find the best motif among all possible motifs? Three strate-

gies can be distinguished depending on the structure of the search space,

which may consist of

(a)

all k- letter words of a given alphabet;

(b)

all probability matrices of length k ; or

(c)

all motif annotations (all subsets of k -letter subsequences of input

sequences).

3.2.1. Finding the best consensus sequence

For consensus sequences based on the four-letter DNA alphabet, the size of

the search space remains computationally manageable up to a word length

of about 15. The optimal motif can thus be found by enumeration, i.e.

by evaluating a frequentist-type objective function for each k- letter word.

Algorithms to this end are reasonably fast, as the input sequences need

only to be scanned once. The word index is first initialized with zeros.

Then, one word frequency is incremented each time a subsequence is

processed (if mismatches are allowed, multiple motif frequencies are

updated for one subsequence). For longer words, heuristic algorithms

have to be used instead of exact methods. An old trick, introduced in the

Bioinformatics: A Swiss Perspective

Search WWH ::

Custom Search

Home