Methods for Discovery and Characterization of DNA Sequence Motifs - Bioinformatics: A Swiss Perspective

Biology Reference

In-Depth Information

early 1980s, 20 is to restrict the search space to those k- letter words which

actually occur in the input sequence set. A popular implementation of the

word search strategy is provided by the program Weeder. 21

For consensus motifs based on the complete 15-letter IUPAC code,

the search space becomes too large for enumerative approaches. Progress

has recently been reported in developing efficient heuristics under certain

constraints. 22 It is debatable, however, whether such algorithms are really

needed. Consensus sequences with ambiguous symbols are still less flex-

ible than position-specific scoring matrices, and efficient and effective

algorithms are readily available for optimizing such matrices.

3.2.2. Optimizing a base probability matrix

The search space for probability matrices is continuous and, thus, poten-

tially infinite. Rigorous algorithms guaranteed to find the optimal motif

are not available or even conceivable for now. The standard heuristic

approach to optimize a probability matrix is to start from an initial motif

description referred to as a seed, and to use an iterative refinement algo-

rithm to reach a local maximum for the objective function. Expectation

maximization (EM) is the classical approach used in this context, intro-

duced to computational biology by Lawrence and Reilly. 23 The iterative

part is straightforward. Based on a current model M k , one uses Bayes'

formula to compute for each subsequence x j a weight w j , which is the

posterior probability that the subsequence constitutes a motif instance.

For the mixture model introduced above, one obtains

(

)

k

j

k

qPx

|

M

(

)

k

j

.

wPMx

=

|

=

(8)

(

) (

)

k

j

k

j

0

qPxM

|

+-

1

qPxM

|

These probabilities are then used as weights to compute a new prob-

ability matrix by adding up weighted base contributions from all

subsequences of the input sequence set:

N

(

)

Â

j

k

wxb

d

=

j

i

j

=

1

k

+

1

,

pi b

(, )

=

(9)

N

Â

k

w

j

=

1

Bioinformatics: A Swiss Perspective

Search WWH ::

Custom Search

Home