Biology Reference
In-Depth Information
exploiting the graph structure of the problem and looking for highly-weighted
clique-like substructures allows for implicit modeling of complex positional de-
pendencies within motifs, otherwise not possible with PSSMs.
We explore a number of formulations and accompanying algorithms in graph-
based motif discovery that have been proposed and applied to find biologically
interesting motifs. The ability of the graph-based approaches to formulate the
problem in such a way that optimal solutions are attainable while overcoming
many of the limitations of the standard approaches demonstrates the power of
these search procedures. Though we don't focus on experimental results in this
article, it should be noted that the performance of these methods in uncovering
known motifs illustrates their utility for novel sequence motif discovery.
The rest of this review focuses on various methodologies in graph-based al-
gorithms, and does not encompass the broader motif finding problem. Recent
surveys [10, 22, 25] and comparative studies [13, 45] can serve as a good practical
guide when choosing a method to be applied in practice.
4.2. Graph-Theoretic Formulation
The motif finding problem, with the input consisting of a length parameter l of
the putative motif, and N sequences, each of a possibly different length, to be
searched for motif occurrences, can be formulated in graph-theoretic terms [46].
The problem is recast as a complete, weighted N -partite graph G =( V,E ), with
a graph part corresponding to each input sequence. Denoting the set of nodes in
the part corresponding to sequence i by V i , the full vertex set is V = V 1 ∪···∪
V N .
In part V i there is a node for every substring of length l in sequence i . Thus, for a
sequence i of length L there are L
l +1nodes in V i (see Fig. 4.1).
Each pair of nodes u and v in different graph parts is joined by an edge
( u,v )
E . Letting seq(u) denote the input substring corresponding to node u ,
the weight w uv on edge ( u,v ) is a function of the similarity between seq(u) and
seq(v) as well as the background nucleotide distribution. Thus, the weight of an
edge connecting pairs of substrings that are unlikely to appear in the background,
is increased. A motif in this formulation is a selection of vertices, correspond-
ing to substrings collectively exhibiting a higher degree of similarity than would
be expected based on the background distribution. It is also possible to relax the
specification of allowing only one motif instance to occur in every sequence, thus
modeling a biologically more realistic version of the problem. The graph con-
struction is altered to permit edges between vertices in the same graph part.
A related formulation of motif finding that has been recast similarly in graph-
theoretic terms by many methods, is that of 'subtle' motifs, introduced by Pevzner
Search WWH ::




Custom Search