Information Technology Reference
In-Depth Information
4.2 Test Data
The data used to test alignment accuracy has no fold-level overlap with the training
and validation data. In particular, we use the following three datasets to test the
alignment accuracy, which are subsets of the test data used in [ 4 ] to benchmark
protein modeling methods.
Set3.6K: a set of 3,617 non-redundant protein pairs. Two proteins in a pair
share <40 % sequence identity and have small length difference. By non-
redundant
￿
we mean that in any two protein pairs, there are at least two proteins
(one from each pair) sharing less than 25 % sequence identity.
Set2.6K: a set of 2,633 non-redundant protein pairs. Two proteins in a pair
share <25 % sequence identity and have length difference larger than 30 %. This
set is mainly used to test an alignment method in handling with domain
boundary.
￿
Set60K: a very large set of 60,929 protein pairs, in most of which two proteins
share less than 40 % sequence identity. Meanwhile, 846, 40,902, and 19,181
pairs are similar at the SCOP family, superfamily and fold level, respectively,
and 151, 2,691 and 2,218 pairs consist of only mainly beta proteins,
respectively.
￿
We use three benchmarks SCOP20, SCOP40 and SCOP80 to test the success
rate of remote homology detection and fold recognition. These benchmarks were
used by S
ö
ding group to study context-speci
c mutation score [ 4 ]. They are con-
structed by
filtering the SCOP database with a maximum sequence identity of 20,
40 and 80 %, respectively. In total they have 4,884, 7,088, and 9,867 proteins,
respectively, and 1,281, 1,806, and 2,734 mainly beta proteins, respectively.
For a protein in the
first three datasets, we run PSI-BLAST with 5 iterations to
find close sequence homologs and the build multiple sequence alignment (MSA).
The MSA
files for the three SCOP benchmarks are downloaded from the HHpred
website ( ftp://toolkit.genzentrum.lmu.de/pub/ ). Pseudo-counts are employed espe-
cially for proteins with very few close sequence homologs.
Programs to compare To evaluate alignment accuracy, we compare our method,
denoted as MRFalign, with sequence-HMM alignment method HMMER [ 5 ] and
HMM-HMM alignment method HHalign [ 6 ]. HHMER is run with a default E-value
threshold (10.0). HHalign is run with the option
. To evaluate the per-
formance of remote homology detection, we compare MRFalign, with PSSM-PSSM
based method FFAS [ 7 ], sequence-HMM based method hmmscan (comparison) and
HMM-HMM based methods HHsearch and HHblits [ 8 ]. Meanwhile, HHsearch and
hmmscan use HHalign and HMMER, respectively, for protein alignment.
-mact 0.1
Evaluation criteria Three performance metrics are employed including reference-
dependent alignment recall and precision, and success rate of homology detection
and fold recognition. Alignment recall is the fraction of align-able residues in a
Search WWH ::




Custom Search