Graph-based Approaches for Motif Discovery - Clustering Challenges in Biological Network

Biology Reference

In-Depth Information

4.3. Linear Programming-based Algorithms

Utilizing a linear programming (LP) framework for motif finding [17, 47] allows

for a very flexible formulation of the problem that is applicable to a number of

variants. This includes the standard motif finding formulation as well as the phy-

logenetic footprinting and subtle motifs problems. Underlying the approach, motif

discovery is cast as the problem of finding the best gapless local multiple sequence

alignment using the sum-of-pairs (SP) scoring scheme, which seeks a l -long sub-

string from every input sequence such that the sum of all their pair-wise simi-

larities is maximized. In the graph G =( V,E ) the equivalent is a selection of

vertices, one from each graph part, such that the weight of the induced clique is

maximized. We refer to this problem as finding the highest weight N -clique.

4.3.1. Edge-Modeling Formulation

For a graph G =( V,E ),where V = V 1 ∪

...

∪

V N and E =

{

( u,v ): u

∈

V i ,v∈ V j ,i = j}

, Zaslavsky and Singh [47] introduce a formulation that explic-

itly models both vertices and edges in the graph. They define a binary decision

variable X u for every vertex u , and a binary decision variable Y uv for every edge

( u,v ). Setting X u to 1 corresponds to selecting vertex u for the N -clique and

thus choosing the sequence position corresponding to u in the alignment. Setting

variable Y uv to 1 corresponds to choosing the edge ( u,v ) for the N -clique.

The following integer linear program (ILP) models the motif finding problem

formulated above:

( u,v ) ∈E w uv ·

Maximize

Y uv

subject to

u∈V j X u =1

for 1

≤

u∈V j Y uv = X v

≤

N,v

∈

V j

for 1

X u ,Y uv ∈{

0 , 1

}

for u

∈

V, ( u,v )

∈

The first set of constraints ensures that exactly one vertex is picked from every

graph part, corresponding to one position being chosen from every input sequence.

The second set of constraints relates vertex variables to edge variables, allowing

the objective function to be expressed in terms of finding a maximum edge-weight

clique. An edge is chosen only if it connects two chosen vertices.

ILP itself is NP-hard, but replacing the integrality constraints on the X and

Y variables with 0

1 allows for a polynomial-time heuristic for

the problem. It is important to note that should a linear programming solution

happen to be integral, it is guaranteed to be optimal for the original ILP and motif

≤

X u ,Y uv ≤

Clustering Challenges in Biological Network

Search WWH ::

Custom Search

Home