Introduction - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

PSI-BLAST represents the sequence pro

le as a position-speci

c frequency matrix

(PSFM) or position-speci

c scoring matrix (PSSM), which is widely-used in many

applications such as homology detection, fold recognition and protein structure

prediction [ 45 , 49 , 50 ]. Both PSFM and PSSM have dimension of 20

N,whereN is

the protein sequence length. Each column in a PSFM contains the occurring fre-

quency of 20 amino acids at the corresponding sequence position. Accordingly, each

column in a PSSM contains the potential of mutating to 20 amino acids at the

corresponding position. A good sequence pro

×

le shall include as much information in

the MSA as possible. In addition to representation, the quality of a sequence pro

le

depends on the following factors: the number of PSI-BLAST iterations, the E-value

cutoff used to determine if two proteins are homologous or not, and the sequence

weighting scheme [ 39 ]. It also depends on how to include amino acid pseudo-counts

in converting amino acid occurring frequency to mutation potential.

Pro

le Hidden Markov Model (HMM) [ 51 ] is another way to model an MSA of

protein homologs. Pro

le HMM is better than PSFM/PSSM in that the former takes

into consideration correlations between adjacent residues and also explicitly models

gaps, so pro

le HMM on average is more sensitive than PSSM/PSFM for protein

alignment and remote homology detection [ 40 , 44 ]. In particular, a pro

le HMM

'

usually contains three states: match, insert and delete. A

state at an MSA

column models the probability of residues being allowed in the column. It also

contains emission probability of each amino acid type at this column. An ' insert ' or

'

match

state at an MSA column allow for insertion of residues between that col-

umn and the next, or for deletion of residues. That is, a pro

delete

'

le HMM has a position-

dependent gap penalty. The penalty for an insertion or deletion depends on the

HMM model parameters in each position. By contrast, traditional sequence align-

ment model uses a position-independent gap penalty. An insertion or deletion of

x residues is typically scored with an af

ne gap penalty, say a

þ

b

ð

x

1

Þ

where a is

the penalty for a gap opening and b for an extended gap.

A few popular homology detection programs such as HHpred [ 44 ] and HMMER

[ 40 ] use pro

le HMMs for remote homology detection. Pfam [ 52 , 53 ] and

SUPERFAMILY [ 54 ] are two large publicly available libraries of pro

le HMMs of

common protein domains. In many applications, pro

le HMM has demonstrated

better performance than PSI-BLAST sequence pro

le (i.e., PSFM/PSSM) in terms

of alignment accuracy and homology detection success rate [ 6 , 44 ]. However, both

pro

le HMM and PSFM/PSSM are restricted in that they cannot model long-range

residue correlation in an MSA [ 55 ].

1.4.3 Scoring Function for Pro

le-Based Alignment

and Homology Detection

A key component of pro

le-based homology detection method is the scoring

function, which measures the similarity of one sequence and one sequence pro

le

or that of two sequence pro

les. Unlike protein sequence alignment that can use an

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home