Graph-based Approaches for Motif Discovery - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

A comprehensive comparison study by Tompa and coauthors [45] showed that

the performance of these two broad groups of methods seem to be complemen-

tary in many cases, with a slight performance advantage demonstrated by repre-

sentative methods of the combinatorial class (e.g., Weeder [34]). However, both

categories of motif finding algorithms have their limitations. Many combinatorial

methods enumerate every possible pattern, and are thus limited in the length of

the motifs they can search for. While this may be less of an issue in eukaryotic

genomes, where transcriptional regulation is mediated combinatorially by a large

number of transcription factors with relatively short binding sites, substantially

longer motifs are found when considering either DNA binding sites in prokaryotic

genomes (e.g., for helix-turn-helix binding domains of transcription factors [37])

or protein motifs [4]. Limitations exist in methods based on the PSSM model as

well. These methods typically employ heuristic search techniques that don't guar-

antee convergence to the globally optimal motif (though some recent progress has

been made in this direction [21]). Additionally, a fundamental problem exists

with the PSSM representation. Position-specific scoring matrices assume inde-

pendence between motif positions, and though this assumption leads to satisfac-

tory results in many applications [2], recent work [6, 26] has shown evidence of

dependencies between positions in transcription factor binding sites.

While the positional independence assumption in PSSMs can be relaxed by

using richer models [1, 48], risk of overfitting exists. More importantly, other

analyses [32] find that the distribution of motif instances in many eukaryotic mo-

tifs exhibit complex dependencies that are not easily modeled probabilistically. It

has been shown [33] that modeling these dependencies using the sequence pattern-

based approaches leads to improved performance in representing and searching

for binding sites; a similar statistically significant improvement is not observed

with PSSMs. In fact, Naughton and colleagues [32] report that simulated motifs

generated by PSSM models display a distribution very different from instances

of real biological motifs. They show that biological motifs exhibit a significantly

higher degree of clustering as measured by sequence similarity. As such, it is bet-

ter to evaluate a candidate motif instance while considering a number of its nearest

neighbors, rather than a model of the entire motif.

These observations naturally lead to considering graph-based approaches to

motif finding, in which graph vertices correspond to substrings in the input, edges

denote the degree of similarity between their DNA sequences, and highly con-

nected graph sub-structures exhibit the clustering behavior necessary to identify

motifs. There exist graph-based formulations and algorithms that possess an ad-

vantage over enumerative methods in that they are not exponential in the problem

input size and thus are not limited by the length of the sought motif. Likewise,

Search WWH ::

Custom Search

Home