Method - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

such PSICOV, Evfold, plmDCA [ 2

4 ] as residue co-evolution. PSICOV assumes

that P ð X Þ

is a Gaussian distribution function and calculates the correlation between

two columns by inverse covariance matrix. By contrast, plmDCA does not assume

a Gaussian distribution and is more ef

cient and also slightly more accurate.

Generally speaking, these programs are time-consuming.

The reliability of mutual information (MI) or direct information (DI) [ 2 ] depends

on the number of non-redundant sequence homologs. When there are few sequence

homologs, the resulting MI or DI is not very accurate. Therefore, it is not enough to

only use residue co-evolution strength to estimate residue interaction strength. We

can use other contact prediction programs such as PhyCMAP [ 4 ] which integrates

both residue col-evolution information, PSI-BLAST sequence pro

le and others to

predict the probability of two residues in contact. PhyCMAP works much better

than PSICOV and Evfold when proteins under study have a small number of

sequence homologs [ 4 ].

In this work, we use predicted inter-residue Euclidean distance to re

ect inter-

action strength of two residues. This is based upon an assumption that two spatially-

close residues tend to have strong interaction. We predict the inter-residue distance

using sequence information such as mutual

information and its power series,

PSI-BLAST sequence pro

le and other protein features. See [ 5 ] for more details.

Below we brie

y describe how to predict inter-residue distance from sequence

information using probabilistic neural networks (PNN).

We discretize C a C a

distance into 13 bins (3

4, 4

5, 5

…

,14

15,

and >15

). Each bin is also called a label. Given a protein and a pair of two

residues i and j, let d k denote the bin into which their distance falls, and x k denote

the protein feature vector consisting of some position-speci

information and also mutual information between two positions. We would like to

estimate the probability of observing d k given the feature vector x k :

c sequence pro

That is, instead

of only considering the most possible distance labels assigned to each pair of nodes

(residues), we would like to estimate the probability distribution of d k :

The reason is

that the predicted distance probability distribution is more informative than a single

predicted value.

Formally, let p h ð

be the probability of the distance label d k conditioned on

the feature vector x k Meanwhile,

d k j

x k Þ

is the model parameter vector. We estimate

p h ð

d k j

x k Þ

as follows:

exp

L h ð

d k ;

x k ÞÞ

p h d k j x k

Þ¼

ð 2 : 2 Þ

Z h ð

x k Þ

x ðÞ¼ P d exp

where Z h

is a two-

layer neural network. Figure 2.2 shows an example of the neural network with three

and

L h ð

;

x k ÞÞ;

is the partition function and L h ð

;

x k Þ

first and second hidden layers, respectively. Each neuron is a

sigmoid function. The function L h ð

five neurons in the

d k ;

x k Þ

can be calculated as,

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home