Cross-Validation, the Jackknife, and the Bootstrap: Excess Error Estimation in Forward Logistic Regression - Common Errors in Statistics

Information Technology Reference

In-Depth Information

Appendix B

Cross-Validation, the

Jackknife, and the Bootstrap:

Excess Error Estimation in

Forward Logistic Regression

GAIL GONG*

Given a prediction rule based on a set of patients, what is the probability of

incorrectly predicting the outcome of a new patient? Call this probability the

true error. An optimistic estimate is the apparent error, or the proportion of

incorrect predictions on the original set of patients, and it is the goal of this

article to study estimates of the excess error, or the difference between the

true and apparent errors. I consider three estimates of the excess error:

cross-validation, the jackknife, and the bootstrap. Using simulations and real

data, the three estimates for a specific prediction rule are compared. When

the prediction rule is allowed to be complicated, overfitting becomes a real

danger, and excess error estimation becomes important. The prediction rule

chosen here is moderately complicated, involving a variable-selection

procedure based on forward logistic regression.

KEY WORDS: Prediction; Error rate estimation; Variables selection.

1. INTRODUCTION

A common goal in medical studies is prediction. Suppose we observe n

patients, x 1 = ( t 1 , y 1 ),..., x n = ( t n , y n ), where y i is a binary variable indicat-

ing whether or not the i th patient dies of chronic hepatitis and t i is a

vector of explanatory variables describing various medical measurements

* Gail Gong is Assistant Professor, Department of Statistics, Carnegie-Mellon University,

Pittsburgh, PA 15217.

Reprinted with permission by the American Statistical Association.

Search WWH ::

Custom Search

Home