Biology Reference
In-Depth Information
one at a time, in an alternating fashion. The center position and width of
the preferred region, as well as the cut-off value, are optimized in an enu-
merative fashion based on the previously introduced objective function.
User-specified upper and lower bounds and increment values define the
search space of all combinations of these three parameters. The weight
matrix is updated by counting the base frequencies in the current set of
putative motif instances, which is unambiguously defined by the current
ensemble of motif parameters. The base frequencies are converted into
log-likelihood weights, as detailed in Sec. 2. A powerful additional fea-
ture of PATOP is that it can shrink or extend the length of the weight
matrix on each iteration. This is achieved by including a number of addi-
tional, adjacent positions in the base frequency matrix compiled from the
current motif instances. The new limits of the matrix are then defined on
the basis of the observed skew of base composition in a matrix column,
evaluated by a
-squared test.
Figure 6 illustrates the effect of the refinement by PATOP with the
eukaryotic TATA box as an example. The initial motif consists of the con-
sensus sequence TATAAA with two mismatches allowed. The length of
the final matrix is 11 base pairs. The DNA sequences and TSS positions
used for refinement correspond to the human promoter subset of the
Eukaryotic Promoter Database (EPD), release 93, 36 1867 sequences in
total. The plot shows the motif frequencies, evaluated in overlapping
windows of width 8 and 20, respectively. In this example, the gain in
local overrepresentation results from an increase in the peak signal fre-
quency and from a decrease in the background frequency. In other
words, the resulting optimized weight matrix has both higher sensitivity
and higher specificity than the input consensus sequence.
χ
6. Conclusions and Perspectives
The success of motif discovery depends, to a large extent, on the suit-
ability of the input data. Ideally, the data set should consist of a large
number of short sequences highly enriched in one particular motif, but
otherwise random. In certain application areas, such data sets are available.
For instance, the SAGE/SELEX technique 37 can produce thousands of
short sequences that bind to a particular transcription factor in vitro .
Search WWH ::




Custom Search