Software - Protein Homology Detection Through Alignment of Markov Random Fields

Information Technology Reference

In-Depth Information

Column 6

'

Edge

'

The accumulative edge alignment potential.

Column 7

'

qRange

'

Range of aligned region in the query protein.

Column 8

'

tRange

'

Range of aligned region in the subject protein.

Column 9

'

tLength

'

The length of the subject protein.

Column 10

'

Cols

'

The number of aligned positions in the pairwise alignment.

Column 11

'

#tGaps

'

The number of gaps in the subject protein (Insertions).

Column 12

'

#qGaps

'

The number of gaps in the query protein (Deletions).

Column 13

'

#seqID

'

The number of identical residues in the alignment.

3.5 Interpreting P-Value

In the ranking

dence score indicating the

relative quality of the top-ranked proteins and (corresponding) alignments. To cal-

culate the P-value, we employs a set of

file, P-value can be interpreted as a con

“

reference proteins (in

databases/

”

CAL_TGT)

1,800 single-domain proteins belonging to dif-

ferent SCOP folds. Given a query protein, we

), which consists of

*

first align it to this reference protein

database and then estimate an extreme value distribution from the

1,800 alignment

scores. Based upon this distribution, we calculate the P-value of each alignment

when aligning the query protein to the subject protein database. The P-value actually

measures the likelihood of each subject protein being homologous to the query

protein by comparing it to the reference proteins.

To see the relationship between the P-value and the closeness of the

*

rst-ranked

protein by MRFsearch to a query protein, we conduct an experiment on the

368 CAMEO target proteins. For each CAMEO target, the

first-ranked protein in

the database is treated as the homolog of this target. To measure the quality of an

alignment, we use un-normalized Global Distance Test (GDT). GDT has been

employed as an of

cial measure of a protein model quality by CASP for many

years. When applied to alignments, uGDT can be interpreted as the number of

correctly-aligned positions in an alignment, but weighted by alignment quality at

each position. We say one alignment is good when its uGDT is larger than 50. We

use 50 as a cutoff because that many proteins similar at only the fold level have

uGDT around 50. Figure 3.2 shows the relationship between P-value and uGDT on

the 368 CAMEO targets. Figure 3.3 is a zoom-in graph of Fig. 3.2 , showing

relationship between P-value and uGDT on the 132 CAMEO targets with

−

log(P-

value) < 20. As shown in Fig. 3.3 , when P-value is small (i.e. <10e

10), most

alignments have uGDT greater than or equal to 50. That is, when P-value is less

than 10e

−

first-ranked protein is very likely to share a similar fold as the

query protein. When P-value is between 10e

−

10, the

5 and 10e

10, more than half of the

−

alignments have uGDT > 50.

Protein Homology Detection Through Alignment of Markov Random Fields

Search WWH ::

Custom Search

Home