Methods for Discovery and Characterization of DNA Sequence Motifs - Bioinformatics: A Swiss Perspective

Biology Reference

In-Depth Information

3.2.4. Finding multiple motifs

In exploratory applications, such as mining promoter sequences for new

transcription regulatory motifs, one often expects to find more than one

motif. For instance, a landmark paper on Drosophila promoters 25 reported

10 ab intio discovered motifs returned in one program run by MEME.

Fortunately, there is a simple and efficient way to extend the basic algo-

rithms presented above to multiple motif discovery. The principle is to pro-

ceed iteratively by searching for one motif at a time, and by progressively

excluding motif instances found from subsequent iterations. More formally,

this means that, after each cycle, the k- letter subsequences attributed to the

newly discovered motif are removed from the search space — a process that

is commonly referred to as “masking” in the sequence analysis literature.

A theoretically more proper approach would use multi-component mixture

models for synchronous optimization of several motifs at a time by EM,

Gibbs, or a progressive local multiple alignment algorithm.

3.2.5. Estimating the significance of a newly discovered motif

The different types of probability values used as objective functions for

motif optimization do not provide an answer to the question

of whether the best motif found is significant or not, as they apply

to single motifs and thus are not corrected for multiple tests. With

consensus sequence motifs, a Bonferroni correction is sometimes applied;

see, for instance, Xie et al . 26 However, this approach is likely to yield

overly conservative P-value estimates, as consensus word frequencies are

highly dependent on each other, especially if mismatches are tolerated.

The program MEME provides significance estimates for matrix-based

motif models based on a maximum likelihood ratio test (LRT), which

takes into account the number of free parameters of the model. 16 This

approach is quite sensitive to the properties of the null model, and in

practice tends to assign low E-values to questionable motifs. A good way

to corroborate the significance of a newly found motif is to rerun

the motif discovery program with randomized or shuffled sequences as

a control, so as to get an idea of what P-values or E-values could be

Search WWH ::

Custom Search

Home