Biology Reference
In-Depth Information
Today, one also has access to large promoter sets defined by high-
throughput methods such as expressed sequence tag (EST) sequencing
of oligo-capped cDNAs
38
and CAGE.
39
However, in promoter analysis,
one still faces the problem that many motifs may only be weakly over-
represented and located at a highly variable distance from the TSS. An
enrichment of motifs in genomic sequences can be achieved in various
ways. For instance, the motif search could be targeted at specific regula-
tory motifs by restricting the input sequences to promoters of genes that
are regulated in a particular way; see, for instance, Roth
et al
.
40
Likewise,
genome-wide chromatin immunoprecipitation profiles for more than
200 transcription factors have been used to select yeast promoters that
are occupied
in vivo
by a given factor in order to define its cognate bind-
ing site motifs with six different motif discovery methods.
41
A presum-
ably very accurate binding site weight matrix has been derived from over
13 000
in vivo
mapped sites for the insulator protein CTCF.
42
Another
currently very trendy approach, exemplified in Xie
et al
.
26
is to restrict the
motif search to sequence regions that are conserved across genomes.
Specialized motif discovery algorithms, such as PhyloGibbs,
43
can exploit
information about phylogenetic conservation contained in a multiple
sequence alignment given as input to the method.
The classical weight matrix model, which assumes that motifs are of
fixed length and that the contribution of individual bases at different
positions are additive and independent of each other, is likely to be an
oversimplification in most cases (see Benos
et al
.
44
for a critical discussion
of this issue). The weight array method takes nearest-neighbor depend-
encies into account by scoring overlapping dinucleotides rather than
individual bases.
45
Target motifs recognized by multimeric DNA-binding
proteins often consist of two or more short motifs separated by spacers
of slightly variable length. In fact, the Pribnow box shown in Fig. 1 is one
of two conserved motifs characteristic of the major class of
E. coli
pro-
moters
.
The classical EM algorithm for motif optimization is readily
extensible to such bipartite, variable-length motif structures.
46
Ab initio
discovery of composite motifs (also referred to as transcription regulatory
modules), including combinations of motifs that may occur in different
order and orientation, is another currently very active research direction;
see Kel
et al
.
47
for an example.