Biology Reference
In-Depth Information
match to a motif is expected to occur by chance in the input sequence
sets that were used for benchmarking. Under these conditions, it is
principally impossible to infer the true motif instances with high reli-
ability. Since the probability matrices returned by motif discovery
algorithms are derived from hypothetical motif instances in the input
set, their quality is compromised by the contamination with false
matches.
Conversely, the motifs corresponding to protein domains have a
much higher complexity, often in the range of 30 bits or more. The
higher information content is due to the increased length (up to 100
amino acids) and the larger size of the protein alphabet (20 instead of 4).
Motifs with this degree of complexity are unlikely to occur by chance
in a protein sequence of average length, and thus can be located with
near certainty. The higher complexity also explains why protein domain
discovery has often been initiated by a single statistically significant
pairwise sequence match retrieved by database search.
4.3. Reasons for the Limited Success of DNA Motif
Discovery
Based on the above considerations, I have doubts whether motif
discovery is rightly considered to be a tough problem. The poor bench-
marking results reported in recent papers are perhaps mostly due to the
inadequacy of the input data sets. Shorter sequences would be needed
to localize motifs with high confidence, and a larger number of motif
instances would be required to obtain reliable base frequency estimates.
Failure by the heuristic algorithms to find the optimal motif is unlikely
to be a major reason for the poor benchmarking results. This could be
tested by comparing the true versus predicted motif annotations in
terms of the objective function used by the motif discovery program.
My conjecture is that the true motif annotation will look less good in
such a test. The fact that the consensus sequence-based motif search
program Weeder 21 showed the best performance in the above-described
tests supports this hypothesis.
In an overfitting situation due to sparse data, methods based on a
simpler model with fewer degrees of freedom tend to perform better.
Search WWH ::




Custom Search