Method - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

solution randomly and then ran the training algorithm on a supercomputer for about

a couple of weeks. Our training algorithm terminated when the probability of either

the training set or the validation set did not improve any more. Note that all the

model parameters are learned from the training set but not the validation set. The

validation set, combined with the training set, is only used to determine when our

training algorithm shall terminate. Our training algorithm usually terminates after

3,000 iterations. We also reran our training algorithm starting from nine initial

solutions and did not observe explicit performance difference among these runs. See

our work on EPAD [ 5 ] for more details.

We use two kinds of input features in this neural network model: PSI-BLAST

sequence pro

le and residue co-evolution. One is context-speci

c sequence pro

le

for a small sequence segment centered at one speci

c residue in question. The

sequence pro

le is generated by running PSI-BLAST on the NR database with

5 iterations and an E-value of 0.001. The other feature we used is residue

co-evolution information. Mutual information is a classical method to measure

residue co-evolution strength. However, mutual information cannot differentiate

direct from indirect interactions. For example, when residue a has strong interaction

with b and b has strong interaction with residue c, it is likely that residue a also has

interaction with c

:

In order to reduce the impact of this kind of indirect information,

some global statistical methods such as Graphical Lasso [ 3 ] and Pseudo-likelihood

[ 8 , 9 ] methods are proposed to estimate residue co-evolution strength. However,

these methods are time-consuming. In this work, to account for chaining effect of

residue coupling, we use the powers of the mutual information matrix. In particular,

let MI denotes the mutual information matrix, we use MI k where k ranges from 2 to

11 to estimate the chaining effect.

2.3 Scoring Similarity of Two Markov Random Fields

This section will introduce how to align two proteins by aligning their corre-

sponding MRFs. As shown in the left picture of Fig. 2.3 , building an alignment is

equivalent to

finding a unique path from the left-top corner to the right-bottom

corner. For each vertex along the path, we need a score to measure how good it is to

transit to the next vertex. That is, we need to measure how similar two nodes of the

two MRFs are. We call this kind of scoring function node alignment potential.

Second, in addition to measure the similarity two aligned MRF nodes, we want to

quantify the similarity between two MRF edges. For example, in the right picture of

Fig. 2.3 residues

“

L

”

and

“

S

”

of the

first protein are aligned to residues

“

A

”

and

“

of the 2nd protein, respectively. We would like to estimate how good it is to

align the pair (L, S) to the pair (A, Q). This pairwise similarity function is a function

of two MRF edges and we call it edge alignment potential. When the edge align-

ment potential is used to score the similarity of two MRFs, Viterbi algorithm or

simple dynamic programming cannot be used to

Q

”

find the optimal alignment. It can

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home