Regression by Parts: Fitting Visually Interpretable Models with GUIDE - Data Visualization

Graphics Reference

In-Depth Information

els produced by most regression techniques, including the most basic ones, are oten

di cult or impossible to interpret. Besides, even when a model is mathematically

interpretable, the conclusions can be far from unambiguous.

Intherestofthisarticle,weusefourexamples tohighlightsomecommondi cul-

ties: (i) effects of collinearity on modeling Boston housing prices (Sect. . ), (ii) in-

clusion of a categorical predictor variable in modeling New Zealand horse mussels

(Sect. . ), (iii) outlier detection amid widespread confounding in US automobile

crash tests (Sect. . ), and (iv) Poisson regression modeling of Swedish car insurance

rates(Sect. . ).Weproposeadivide-and-conquer strategytosolve theseproblems.It

isbasedonpartitioningthedatasetintonaturallyinterpretablesubsetssuchthatarel-

atively simple and visualizable regression model can be fitted to each subset. A criti-

cal requirement is that the partitions be free of selection bias. Otherwise, inferences

drawnfromthepartitions maybeincorrect. Anotherrequirement isthatthesolution

be capable of determining the number and type of partitions by itself. In Sect. . we

present an implementation derived fromthe GUIDE regression tree algorithm (Loh,

). At the time of this writing, GUIDE is the only algorithm that has the above

properties as well as other desirable features.

Boston Housing Data -

Efects of Collinearity

7.2

he well-known Boston housing dataset was collected by Harrison and Rubinfeld

( ) to study the effect of air pollution on real estate price in the greater Boston

area in the s. Belsley et al. ( ) drew attention to the data when they used it to

illustrate regression diagnostic techniques. he data consist of observations on

variables, with each observation pertaining to one census tract. Table . gives the

names and definitions of the variables. We use the version of the data that incorpo-

rates the minor corrections found by Gilley and Pace ( ).

Harrison and Rubinfeld ( ) fitted the linear model

β NOX

β RM

log

β CRIM

β ZN

β INDUS

β CHAS

(

MEDV

β AGE

β log

(

DIS

β log

(

RAD

β TAX

β PT

β B

β log

(

STAT

)

whoseleast-squares estimates, t-statistics, andmarginal correlation between eachre-

gressor and log

aregiveninTable . .Notetheliberaluseofthesquareand

log transformations. Although many of the signs of the coe cient estimates are rea-

sonable and expected, those of log

(

MEDV

)

are somewhat surprising be-

cause their signs contradict those of their respective marginal correlations with the

response variable. For example, the regression coe cient of log

(

DIS

)

and log

(

RAD

)

(

DIS

)

is negative but

the plot in Fig. . shows a positive slope.

Data Visualization

Search WWH ::

Custom Search

Home