Methods for Discovery and Characterization of DNA Sequence Motifs - Bioinformatics: A Swiss Perspective

Biology Reference

In-Depth Information

Today, one also has access to large promoter sets defined by high-

throughput methods such as expressed sequence tag (EST) sequencing

of oligo-capped cDNAs 38 and CAGE. 39 However, in promoter analysis,

one still faces the problem that many motifs may only be weakly over-

represented and located at a highly variable distance from the TSS. An

enrichment of motifs in genomic sequences can be achieved in various

ways. For instance, the motif search could be targeted at specific regula-

tory motifs by restricting the input sequences to promoters of genes that

are regulated in a particular way; see, for instance, Roth et al . 40 Likewise,

genome-wide chromatin immunoprecipitation profiles for more than

200 transcription factors have been used to select yeast promoters that

are occupied in vivo by a given factor in order to define its cognate bind-

ing site motifs with six different motif discovery methods. 41 A presum-

ably very accurate binding site weight matrix has been derived from over

13 000 in vivo mapped sites for the insulator protein CTCF. 42 Another

currently very trendy approach, exemplified in Xie et al . 26 is to restrict the

motif search to sequence regions that are conserved across genomes.

Specialized motif discovery algorithms, such as PhyloGibbs, 43 can exploit

information about phylogenetic conservation contained in a multiple

sequence alignment given as input to the method.

The classical weight matrix model, which assumes that motifs are of

fixed length and that the contribution of individual bases at different

positions are additive and independent of each other, is likely to be an

oversimplification in most cases (see Benos et al . 44 for a critical discussion

of this issue). The weight array method takes nearest-neighbor depend-

encies into account by scoring overlapping dinucleotides rather than

individual bases. 45 Target motifs recognized by multimeric DNA-binding

proteins often consist of two or more short motifs separated by spacers

of slightly variable length. In fact, the Pribnow box shown in Fig. 1 is one

of two conserved motifs characteristic of the major class of E. coli pro-

moters . The classical EM algorithm for motif optimization is readily

extensible to such bipartite, variable-length motif structures. 46 Ab initio

discovery of composite motifs (also referred to as transcription regulatory

modules), including combinations of motifs that may occur in different

order and orientation, is another currently very active research direction;

see Kel et al . 47 for an example.

Search WWH ::

Custom Search

Home