Biology Reference
In-Depth Information
As mentioned before, the results of these studies were disappointing.
With the eukaryotic benchmarking set, sensitivity and specificity var-
ied between 10% and 30%, 30 and only marginally better results were
obtained with the prokaryotic test set. 29 Similar results were obtained
with synthetic sequences, where the experimentally defined motif
instances were hidden in computer-generated random sequences cor-
responding to a Markov chain model.
The primary reason for the poor performance is probably related to
the characteristics of the input sets, typically consisting of only a few (in
the order of 10) rather long sequences. The total number of motif
instances hidden in the test sequences was also relatively low, below 20
in most cases. These are unfavorable conditions for motif discovery.
A large number of short sequences, highly enriched in a given motif, would
have made the task of the motif discovery program much easier.
4.2. Why is Protein Domain Discovery Easier?
At first glance, regulatory proteins resemble gene regulatory regions in
that they have a modular architecture. The modules are motifs; in
proteins, they are also called conserved domains. The key difference,
however, lies in the complexity of the modules.
Regulatory DNA elements are short and based on a smaller alphabet.
Complexity is expressed by the information content IC , which is com-
puted from a base probability matrix as follows 3 :
k
T
pib
(, )
.
Â
Â
.
IC
=
p i b
(, )log
(12)
2
025
i
=
1
bA
=
The definition of information content has the form of a conditional
entropy, where the null model consists of uniform base probabilities
of 0.25 for each base. The information content, which is expressed in
bits, indicates the random occurrence probability of a motif, which is
2 IC . Typical transcription factor binding site matrices in TRANSFAC
have an IC value of about 10 bits, which means that they are expected
to occur about once in every 1000 bp. Consequently, about one
Search WWH ::




Custom Search