Combining Biochemical Features and Evolutionary Information for Predicting DNA-Binding Residues in Protein Sequences - Advances in Computational Science and Engineering

Information Technology Reference

In-Depth Information

data instance, a RF classifier combines the votes made by the decision trees, and gives

the most popular class as the output of the ensemble. It has been shown that RFs out-

perform AdaBoost ensembles on noisy datasets, and can work well on data with many

weak inputs [11]. These characteristics of RFs are appealing since the DNA-binding

data appear to be noisy and may contain many weak sequence-derived features.

The software available at http://www.stat.berkeley.edu/~breiman/RandomForests/

was used to construct the RF classifiers in this study with the default parameter set-

tings for training. In particular, the number of variables selected to split each node ( m )

was set to the floor of square root of the total number of input variables (default set-

ting). Other values of m were also tested, but did not result in significant improvement

of classifier performance.

2.5 Classifier Evaluation

A fivefold cross-validation approach was used to provide the initial estimation of

classifier performance on the PDNA-62 dataset. The trained classifier was further

evaluated using the PDC25t dataset. The following performance measures were used

in this work:

TP

+

TN

(4)

Accuracy

=

TP

+

TN

+

FP

+

FN

TP

+

Sensitivit

y

=

(5)

TP

FN

TN

+

Specificit

y

=

(6)

TN

FP

Sensitivit

y

+

Specificit

y

Strength

=

(7)

2

where TP is the number of true positives (binding residues with positive predictions);

TN is the number of true negatives (non-binding residues with negative predictions); FP

is the number of false positives (non-binding residues but predicted as binding sites);

and FN is the number of false negatives (binding residues but predicted as non-binding

sites). Since the datasets used in this study are imbalanced, the overall accuracy alone

could be misleading. Thus, both sensitivity and specificity are also computed from pre-

diction results. Furthermore, the average of sensitivity and specificity, referred to as

strength in this paper, may provide a fair measure of classifier performance as shown in

previous studies [4,9].

Matthews Correlation Coefficient (MCC) is commonly used as a measure of the

quality of binary classifications [17]. It measures the correlation between predictions

and the actual class labels. However, for imbalanced datasets, different tradeoffs of

sensitivity and specificity may give rise to different MCC values for a classifier. MCC

is defined as:

TP

×

TN

−

FP

×

FN

(8)

MCC

=

(

TP

+

FP

)(

TP

+

FN

)(

TN

+

FP

)(

TN

+

FN

)

Advances in Computational Science and Engineering

Search WWH ::

Custom Search

Home