Methods for Discovery and Characterization of DNA Sequence Motifs - Bioinformatics: A Swiss Perspective

Biology Reference

In-Depth Information

are the target of optimization. The search space thus becomes discrete

and finite, but still remains astronomically large. The classical algorithm

in this category is the Gibbs sampler, 24 a stochastic variant of EM. This

approach focuses on actual motif instances rather than a generalized

description of the motif. In other words, one would primarily like to

know the locations of the functional elements in the given sequences

rather than be able to predict motif instances in new sequences. From a

theoretical and computational viewpoint, the differences are relatively

minor, since both EM and Gibbs sampling update a probability matrix.

However, whereas EM assigns k- letter subsequences probabilistically to

the two mixtures of the model, Gibbs sampling attributes them explicitly

to one or the other class. The decision whether a given subsequence x j is

included in the motif set for the next iteration is made randomly based

on its current posterior probability w j , as defined previously. The new

probability matrix is typically estimated by a maximum a posteriori

(MAP) likelihood method, for instance, by adding one pseudocount to

each base frequency:

Â

j

1

+

d

(

xb

=

)

i

M k

M

j

Œ

k

+

1

,

pi b

(, )

=

(11)

k

|

+

4

b and 0 otherwise. M k denotes the current set

of motif instances, and | M k | the number of elements therein.

The fact that Gibbs sampling has an in-built stochastic element

helps to overcome the risk of getting trapped in a local optimum far

away from the global optimum. Another advantage of a nondeterminis-

tic algorithm is that it potentially returns different results when run

several times from the same seed, which also increases its chances of

finding the globally optimal solution. Optimization in motif annotation

space can also be done in a deterministic fashion. For instance, one

could define the new motif instances by accepting all k -letter subse-

quences with posterior probabilities higher than 0.5. This leads to an

iterative, multiple alignment algorithm, similar to the PATOP algorithm

described in Sec. 5.

where

δ

( x i =

b ) is 1 if x i =

Bioinformatics: A Swiss Perspective

Search WWH ::

Custom Search

Home