Information Technology Reference
In-Depth Information
PSI-BLAST represents the sequence pro
le as a position-speci
c frequency matrix
(PSFM) or position-speci
c scoring matrix (PSSM), which is widely-used in many
applications such as homology detection, fold recognition and protein structure
prediction [ 45 , 49 , 50 ]. Both PSFM and PSSM have dimension of 20
N,whereN is
the protein sequence length. Each column in a PSFM contains the occurring fre-
quency of 20 amino acids at the corresponding sequence position. Accordingly, each
column in a PSSM contains the potential of mutating to 20 amino acids at the
corresponding position. A good sequence pro
×
le shall include as much information in
the MSA as possible. In addition to representation, the quality of a sequence pro
le
depends on the following factors: the number of PSI-BLAST iterations, the E-value
cutoff used to determine if two proteins are homologous or not, and the sequence
weighting scheme [ 39 ]. It also depends on how to include amino acid pseudo-counts
in converting amino acid occurring frequency to mutation potential.
Pro
le Hidden Markov Model (HMM) [ 51 ] is another way to model an MSA of
protein homologs. Pro
le HMM is better than PSFM/PSSM in that the former takes
into consideration correlations between adjacent residues and also explicitly models
gaps, so pro
le HMM on average is more sensitive than PSSM/PSFM for protein
alignment and remote homology detection [ 40 , 44 ]. In particular, a pro
le HMM
'
'
usually contains three states: match, insert and delete. A
state at an MSA
column models the probability of residues being allowed in the column. It also
contains emission probability of each amino acid type at this column. An ' insert ' or
'
match
state at an MSA column allow for insertion of residues between that col-
umn and the next, or for deletion of residues. That is, a pro
delete
'
le HMM has a position-
dependent gap penalty. The penalty for an insertion or deletion depends on the
HMM model parameters in each position. By contrast, traditional sequence align-
ment model uses a position-independent gap penalty. An insertion or deletion of
x residues is typically scored with an af
ne gap penalty, say a
þ
b
ð
x
1
Þ
where a is
the penalty for a gap opening and b for an extended gap.
A few popular homology detection programs such as HHpred [ 44 ] and HMMER
[ 40 ] use pro
le HMMs for remote homology detection. Pfam [ 52 , 53 ] and
SUPERFAMILY [ 54 ] are two large publicly available libraries of pro
le HMMs of
common protein domains. In many applications, pro
le HMM has demonstrated
better performance than PSI-BLAST sequence pro
le (i.e., PSFM/PSSM) in terms
of alignment accuracy and homology detection success rate [ 6 , 44 ]. However, both
pro
le HMM and PSFM/PSSM are restricted in that they cannot model long-range
residue correlation in an MSA [ 55 ].
1.4.3 Scoring Function for Pro
le-Based Alignment
and Homology Detection
A key component of pro
le-based homology detection method is the scoring
function, which measures the similarity of one sequence and one sequence pro
le
or that of two sequence pro
les. Unlike protein sequence alignment that can use an
Search WWH ::




Custom Search