Information Technology Reference
In-Depth Information
on the i th patient. These n patients are called the training sample. We
apply a prediction rule h to the training sample x = ( x 1 ,..., x n ) to form
the realized prediction rule h x . Given a new patient whose medical mea-
surements are summarized by the vector t 0 , we predict whether or not he
will die of chronic hepatitis by h x ( i 0 ), which takes on values “death” or
“not death.” Allowing the prediction rule to be complicated, perhaps
including transforming and choosing from many variables and estimating
parameters, we want to know: What is the error rate, or the probability of
predicting a future observation incorrectly?
A possible estimate of the error rate is the proportion of errors that h x
makes when applied to the original observations x 1 ,..., x n . Because the
same observations are used for both forming and assessing the prediction
rule, this proportion, which I call the apparent error, underestimates the
error rate.
To correct for this bias, we might use cross-validation, the jackknife, or
the bootstrap for estimating excess errors (e.g., see Efron 1982). We study
the performance of these three methods for a specific prediction rule.
Excess error estimation is especially important when the training sample is
small relative to the number of parameters requiring estimation, because
the apparent error can be seriously biased. In the chronic hepatitis
example, if the dimension of t i is large relative to n , we might use a pre-
diction rule that selects a subset of the variables that we hope are strong
predictors. Specifically, I will consider a prediction rule based on forward
logistic regression. I apply this prediction rule to some chronic hepatitis
data collected at Stanford Hospital and to some simulated data. In the
simulated data, I compare the performance of the three methods and find
that cross-validation and the jackknife do not offer significant improve-
ment over the apparent error, whereas the improvement given by the
bootstrap is substantial.
A review of required definitions appears in Section 2. In Section 3, I
discuss a prediction rule based on forward logistic regression and apply it
to the chronic hepatitis data. In Sections 4 and 5, I apply the rule to
simulated data. Section 6 concludes.
2. DEFINITIONS
I briefly review the definitions that will be used in later discussions. These
definitions are essentially those given by Efron (1982). Let x 1 = ( t 1 , y 1 ),...,
x n = ( t n , y n ) be independent and identically distributed from an unknown
distribution F , where t i is a p -dimensional row vector of real-valued
explanatory variables and y i is a real-valued response. Let be the empirical
distribution function that puts mass 1/ n at each point x 1 ,..., x n . We apply
a prediction rule h to this training sample and form the realized prediction
F
Search WWH ::




Custom Search