Regression by Parts: Fitting Visually Interpretable Models with GUIDE - Data Visualization

Graphics Reference

In-Depth Information

b) If X i and X j are both categorical variables, usetheir value-pairs todivide the

sample space. For example, if X i and X j take c i and c j values, respectively,

thechi-squaredstatisticand p-valuearecomputedfromatablewithtworows

and number of columns equal to c i c j less the number of columns with zero

totals.

c) If X i is ordered and X j is categorical, divide the X i -space into two at the

sample median and the X j -space into as many sets as the number of cate-

gories in its range - if X j has c categories,thissplitsthe

(

X i , X j

)

-space into

c contingency table withthesigns oftheresiduals

asrowsandthe c subsets ascolumns. Computethechi-squared statistic and

its p-value, ater dropping any columns with zero totals.

. Let p ( i ) denote the smallest p-value and let X

c subsets.Construct a

( i )

and X

denote the pair of

variables associated with p ( i ) .

AterAlgorithm terminates,weprunethetreewiththemethoddescribedinBreiman

etal.( ,Sect. . )usingV-fold cross-validation. Let E be the smallest cross-

validation estimate of prediction mean squared error and let α be a positive number.

We select the smallest subtree whose cross-validation estimate of mean squared er-

ror is within α times the standard error of E . To prevent large prediction errors

caused by extrapolation, we also truncate all predicted values so that they lie within

the range of the data values in their respective nodes. he examples here employ the

default values of V

. ; we call this the half-SE rule.

Our split-selection approach is different from that of CART, which constructs

piecewise constant models only and which searches for the best variable to split and

the best split point simultaneously at each node. his requires the evaluation of all

possiblesplitsoneverypredictorvariable. hus,ifthereare K orderedpredictorvari-

ables eachtaking M distinct values atanode, K

and α

splits havetobeevaluated. To

extendthe CARTapproachtopiecewiselinear regression, twolinear modelsmustbe

fitted for each candidate split. his means that K

(

−

)

regression models must be

computed before a split is found. he corresponding number of regression models

for K categorical predictors each having M distinct values is K

(

−

)

M −

.GUIDE,

in contrast, only fits regression models to variables associated with the most signifi-

cant curvature orinteraction test.husthe computational savings can besubstantial.

More important than computation, however, is that CART's variable selection is in-

herently biased toward choosing variables that permit more splits. For example, if

two ordered variables are both independent of the response variable, the one with

moreunique values has ahigherchance of being selected byCART.GUIDE does not

have such bias because it uses p-values for variable selection.

(

−

)

Mussels - Categorical Predictors and SIR

7.4

In this section, we use GUIDE to reanalyze a dataset, previously studied by Cook

( ), to show that GUIDE can deal with categorical predictor variables as natu-

Data Visualization

Search WWH ::

Custom Search

Home