Experiments and Results - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

4.2 Test Data

The data used to test alignment accuracy has no fold-level overlap with the training

and validation data. In particular, we use the following three datasets to test the

alignment accuracy, which are subsets of the test data used in [ 4 ] to benchmark

protein modeling methods.

Set3.6K: a set of 3,617 non-redundant protein pairs. Two proteins in a pair

share <40 % sequence identity and have small length difference. By “ non-

redundant

we mean that in any two protein pairs, there are at least two proteins

(one from each pair) sharing less than 25 % sequence identity.

”

Set2.6K: a set of 2,633 non-redundant protein pairs. Two proteins in a pair

share <25 % sequence identity and have length difference larger than 30 %. This

set is mainly used to test an alignment method in handling with domain

boundary.

Set60K: a very large set of 60,929 protein pairs, in most of which two proteins

share less than 40 % sequence identity. Meanwhile, 846, 40,902, and 19,181

pairs are similar at the SCOP family, superfamily and fold level, respectively,

and 151, 2,691 and 2,218 pairs consist of only mainly beta proteins,

respectively.

We use three benchmarks SCOP20, SCOP40 and SCOP80 to test the success

rate of remote homology detection and fold recognition. These benchmarks were

used by S

ö

ding group to study context-speci

c mutation score [ 4 ]. They are con-

structed by

filtering the SCOP database with a maximum sequence identity of 20,

40 and 80 %, respectively. In total they have 4,884, 7,088, and 9,867 proteins,

respectively, and 1,281, 1,806, and 2,734 mainly beta proteins, respectively.

For a protein in the

first three datasets, we run PSI-BLAST with 5 iterations to

find close sequence homologs and the build multiple sequence alignment (MSA).

The MSA

files for the three SCOP benchmarks are downloaded from the HHpred

website ( ftp://toolkit.genzentrum.lmu.de/pub/ ). Pseudo-counts are employed espe-

cially for proteins with very few close sequence homologs.

Programs to compare To evaluate alignment accuracy, we compare our method,

denoted as MRFalign, with sequence-HMM alignment method HMMER [ 5 ] and

HMM-HMM alignment method HHalign [ 6 ]. HHMER is run with a default E-value

threshold (10.0). HHalign is run with the option

. To evaluate the per-

formance of remote homology detection, we compare MRFalign, with PSSM-PSSM

based method FFAS [ 7 ], sequence-HMM based method hmmscan (comparison) and

HMM-HMM based methods HHsearch and HHblits [ 8 ]. Meanwhile, HHsearch and

hmmscan use HHalign and HMMER, respectively, for protein alignment.

-mact 0.1

“

”

Evaluation criteria Three performance metrics are employed including reference-

dependent alignment recall and precision, and success rate of homology detection

and fold recognition. Alignment recall is the fraction of align-able residues in a

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home