Biology Reference
In-Depth Information
expected for fortuitous motifs. Shuffling methods which preserve higher-
order Markov chain properties 27
of the real sequences are recommended
for this purpose.
4. Bottlenecks and Limitations
DNA motif discovery is considered to be a tough problem. This is sur-
prising in view of the apparently simple structure of DNA motifs, as
compared to protein sequence motifs. The perception that the problem
is difficult is partly based on the poor track record in terms of important
discoveries made by this approach, which were later confirmed by experi-
mental follow-up studies. This contrasts with the great success of similar
methods in discovering new protein sequence motifs. 28 Recent evaluation
studies based on representative and realistic benchmark sequence sets
indeed confirmed that current state-of-the-art motif discovery programs
are highly ineffective in rediscovering experimentally characterized motif
instances (transcription factor binding sites) hidden in gene regulatory
sequences. 29,30 However, these results have to be interpreted with caution.
Let us first have a look at the benchmarking procedure.
4.1. Benchmarking Procedures for Motif Discovery
Claims about poor performance of motif discovery algorithms require that
the community agrees on how performance is measured. In this sense, the
recent benchmarking papers have made an invaluable contribution to
structuring the field by better defining the problem. It is thus of paramount
importance to the newcomer to understand how these tests were set up.
The procedure is schematized in Fig. 3. The benchmark sets consist of
DNA sequences of a few hundred base pairs in length containing anno-
tated transcription factor binding sites, which constitute the motifs to be
discovered. The experimental motif annotations of the eukaryotic bench-
marking set were taken from TRANSFAC, 31 while the prokaryotic test set is
based on RegulonDB. 32 Both resources are manually curated databases rely-
ing on experimental results published in journal articles.
One test per motif is carried out. An input sequence set consists
of all sequences containing a particular motif. The experimental
Search WWH ::




Custom Search