Information Technology Reference
In-Depth Information
Appendix B
Cross-Validation, the
Jackknife, and the Bootstrap:
Excess Error Estimation in
Forward Logistic Regression
GAIL GONG*
Given a prediction rule based on a set of patients, what is the probability of
incorrectly predicting the outcome of a new patient? Call this probability the
true error. An optimistic estimate is the apparent error, or the proportion of
incorrect predictions on the original set of patients, and it is the goal of this
article to study estimates of the excess error, or the difference between the
true and apparent errors. I consider three estimates of the excess error:
cross-validation, the jackknife, and the bootstrap. Using simulations and real
data, the three estimates for a specific prediction rule are compared. When
the prediction rule is allowed to be complicated, overfitting becomes a real
danger, and excess error estimation becomes important. The prediction rule
chosen here is moderately complicated, involving a variable-selection
procedure based on forward logistic regression.
KEY WORDS: Prediction; Error rate estimation; Variables selection.
1. INTRODUCTION
A common goal in medical studies is prediction. Suppose we observe n
patients, x 1 = ( t 1 , y 1 ),..., x n = ( t n , y n ), where y i is a binary variable indicat-
ing whether or not the i th patient dies of chronic hepatitis and t i is a
vector of explanatory variables describing various medical measurements
* Gail Gong is Assistant Professor, Department of Statistics, Carnegie-Mellon University,
Pittsburgh, PA 15217.
Reprinted with permission by the American Statistical Association.
Search WWH ::




Custom Search