Biology Reference
In-Depth Information
early 1980s, 20 is to restrict the search space to those k- letter words which
actually occur in the input sequence set. A popular implementation of the
word search strategy is provided by the program Weeder. 21
For consensus motifs based on the complete 15-letter IUPAC code,
the search space becomes too large for enumerative approaches. Progress
has recently been reported in developing efficient heuristics under certain
constraints. 22 It is debatable, however, whether such algorithms are really
needed. Consensus sequences with ambiguous symbols are still less flex-
ible than position-specific scoring matrices, and efficient and effective
algorithms are readily available for optimizing such matrices.
3.2.2. Optimizing a base probability matrix
The search space for probability matrices is continuous and, thus, poten-
tially infinite. Rigorous algorithms guaranteed to find the optimal motif
are not available or even conceivable for now. The standard heuristic
approach to optimize a probability matrix is to start from an initial motif
description referred to as a seed, and to use an iterative refinement algo-
rithm to reach a local maximum for the objective function. Expectation
maximization (EM) is the classical approach used in this context, intro-
duced to computational biology by Lawrence and Reilly. 23 The iterative
part is straightforward. Based on a current model M k , one uses Bayes'
formula to compute for each subsequence x j a weight w j , which is the
posterior probability that the subsequence constitutes a motif instance.
For the mixture model introduced above, one obtains
(
)
k
j
k
qPx
|
M
(
)
k
k
j
.
wPMx
=
|
=
(8)
(
) (
) (
)
k
j
k
k
j
0
qPxM
|
+-
1
qPxM
|
These probabilities are then used as weights to compute a new prob-
ability matrix by adding up weighted base contributions from all
subsequences of the input sequence set:
N
(
)
Â
j
k
wxb
d
=
j
i
j
=
1
k
+
1
,
pi b
(, )
=
(9)
N
Â
k
w
j
j
=
1
 
Search WWH ::




Custom Search