Graphics Reference
In-Depth Information
b) If X i and X j are both categorical variables, usetheir value-pairs todivide the
sample space. For example, if X i and X j take c i and c j values, respectively,
thechi-squaredstatisticand p-valuearecomputedfromatablewithtworows
and number of columns equal to c i c j less the number of columns with zero
totals.
c) If X i is ordered and X j is categorical, divide the X i -space into two at the
sample median and the X j -space into as many sets as the number of cate-
gories in its range - if X j has c categories,thissplitsthe
(
X i , X j
)
-space into
c contingency table withthesigns oftheresiduals
asrowsandthe c subsets ascolumns. Computethechi-squared statistic and
its p-value, ater dropping any columns with zero totals.
. Let p ( i ) denote the smallest p-value and let X
c subsets.Construct a
( i )
( i )
and X
denote the pair of
variables associated with p ( i ) .
AterAlgorithm terminates,weprunethetreewiththemethoddescribedinBreiman
etal.( ,Sect. . )usingV-fold cross-validation. Let E be the smallest cross-
validation estimate of prediction mean squared error and let α be a positive number.
We select the smallest subtree whose cross-validation estimate of mean squared er-
ror is within α times the standard error of E . To prevent large prediction errors
caused by extrapolation, we also truncate all predicted values so that they lie within
the range of the data values in their respective nodes. he examples here employ the
default values of V
. ; we call this the half-SE rule.
Our split-selection approach is different from that of CART, which constructs
piecewise constant models only and which searches for the best variable to split and
the best split point simultaneously at each node. his requires the evaluation of all
possiblesplitsoneverypredictorvariable. hus,ifthereare K orderedpredictorvari-
ables eachtaking M distinct values atanode, K
and α
=
=
splits havetobeevaluated. To
extendthe CARTapproachtopiecewiselinear regression, twolinear modelsmustbe
fitted for each candidate split. his means that K
(
M
)
regression models must be
computed before a split is found. he corresponding number of regression models
for K categorical predictors each having M distinct values is K
(
M
)
M
.GUIDE,
in contrast, only fits regression models to variables associated with the most signifi-
cant curvature orinteraction test.husthe computational savings can besubstantial.
More important than computation, however, is that CART's variable selection is in-
herently biased toward choosing variables that permit more splits. For example, if
two ordered variables are both independent of the response variable, the one with
moreunique values has ahigherchance of being selected byCART.GUIDE does not
have such bias because it uses p-values for variable selection.
(
)
Mussels - Categorical Predictors and SIR
7.4
In this section, we use GUIDE to reanalyze a dataset, previously studied by Cook
( ), to show that GUIDE can deal with categorical predictor variables as natu-
Search WWH ::




Custom Search