Biology Reference
In-Depth Information
A comprehensive comparison study by Tompa and coauthors [45] showed that
the performance of these two broad groups of methods seem to be complemen-
tary in many cases, with a slight performance advantage demonstrated by repre-
sentative methods of the combinatorial class (e.g., Weeder [34]). However, both
categories of motif finding algorithms have their limitations. Many combinatorial
methods enumerate every possible pattern, and are thus limited in the length of
the motifs they can search for. While this may be less of an issue in eukaryotic
genomes, where transcriptional regulation is mediated combinatorially by a large
number of transcription factors with relatively short binding sites, substantially
longer motifs are found when considering either DNA binding sites in prokaryotic
genomes (e.g., for helix-turn-helix binding domains of transcription factors [37])
or protein motifs [4]. Limitations exist in methods based on the PSSM model as
well. These methods typically employ heuristic search techniques that don't guar-
antee convergence to the globally optimal motif (though some recent progress has
been made in this direction [21]). Additionally, a fundamental problem exists
with the PSSM representation. Position-specific scoring matrices assume inde-
pendence between motif positions, and though this assumption leads to satisfac-
tory results in many applications [2], recent work [6, 26] has shown evidence of
dependencies between positions in transcription factor binding sites.
While the positional independence assumption in PSSMs can be relaxed by
using richer models [1, 48], risk of overfitting exists. More importantly, other
analyses [32] find that the distribution of motif instances in many eukaryotic mo-
tifs exhibit complex dependencies that are not easily modeled probabilistically. It
has been shown [33] that modeling these dependencies using the sequence pattern-
based approaches leads to improved performance in representing and searching
for binding sites; a similar statistically significant improvement is not observed
with PSSMs. In fact, Naughton and colleagues [32] report that simulated motifs
generated by PSSM models display a distribution very different from instances
of real biological motifs. They show that biological motifs exhibit a significantly
higher degree of clustering as measured by sequence similarity. As such, it is bet-
ter to evaluate a candidate motif instance while considering a number of its nearest
neighbors, rather than a model of the entire motif.
These observations naturally lead to considering graph-based approaches to
motif finding, in which graph vertices correspond to substrings in the input, edges
denote the degree of similarity between their DNA sequences, and highly con-
nected graph sub-structures exhibit the clustering behavior necessary to identify
motifs. There exist graph-based formulations and algorithms that possess an ad-
vantage over enumerative methods in that they are not exponential in the problem
input size and thus are not limited by the length of the sought motif. Likewise,
Search WWH ::




Custom Search