Graphics Reference
In-Depth Information
els produced by most regression techniques, including the most basic ones, are oten
di cult or impossible to interpret. Besides, even when a model is mathematically
interpretable, the conclusions can be far from unambiguous.
Intherestofthisarticle,weusefourexamples tohighlightsomecommondi cul-
ties: (i) effects of collinearity on modeling Boston housing prices (Sect. . ), (ii) in-
clusion of a categorical predictor variable in modeling New Zealand horse mussels
(Sect. . ), (iii) outlier detection amid widespread confounding in US automobile
crash tests (Sect. . ), and (iv) Poisson regression modeling of Swedish car insurance
rates(Sect. . ).Weproposeadivide-and-conquer strategytosolve theseproblems.It
isbasedonpartitioningthedatasetintonaturallyinterpretablesubsetssuchthatarel-
atively simple and visualizable regression model can be fitted to each subset. A criti-
cal requirement is that the partitions be free of selection bias. Otherwise, inferences
drawnfromthepartitions maybeincorrect. Anotherrequirement isthatthesolution
be capable of determining the number and type of partitions by itself. In Sect. . we
present an implementation derived fromthe GUIDE regression tree algorithm (Loh,
). At the time of this writing, GUIDE is the only algorithm that has the above
properties as well as other desirable features.
Boston Housing Data -
Efects of Collinearity
7.2
he well-known Boston housing dataset was collected by Harrison and Rubinfeld
( ) to study the effect of air pollution on real estate price in the greater Boston
area in the s. Belsley et al. ( ) drew attention to the data when they used it to
illustrate regression diagnostic techniques. he data consist of observations on
variables, with each observation pertaining to one census tract. Table . gives the
names and definitions of the variables. We use the version of the data that incorpo-
rates the minor corrections found by Gilley and Pace ( ).
Harrison and Rubinfeld ( ) fitted the linear model
β NOX
β RM
log
β
β CRIM
β ZN
β INDUS
β CHAS
(
MEDV
)=
+
+
+
+
+
+
+
β AGE
+
β log
(
DIS
)+
β log
(
RAD
)+
β TAX
+
β PT
+
β B
+
β log
(
STAT
)
whoseleast-squares estimates, t-statistics, andmarginal correlation between eachre-
gressor and log
aregiveninTable . .Notetheliberaluseofthesquareand
log transformations. Although many of the signs of the coe cient estimates are rea-
sonable and expected, those of log
(
MEDV
)
are somewhat surprising be-
cause their signs contradict those of their respective marginal correlations with the
response variable. For example, the regression coe cient of log
(
DIS
)
and log
(
RAD
)
(
DIS
)
is negative but
the plot in Fig. . shows a positive slope.
Search WWH ::




Custom Search