Can you relate in multiple ways? Multiple linear regression and stepwise regression - Improving the User Experience through Practical Data Analytics

Database Reference

In-Depth Information

A classic example of this might be if we were predicting a person's weight (Y),

and two of the variables were the person's height and his/her pant length. Clearly,

each of these variables is a signiicant predictor of weight; nobody can deny that on

average, if a person is taller, he/she weighs more. If we assume that these two X vari-

ables are 99% correlated (to the authors, a reasonable assumption, although we've

never done an actual study!!), the multiple regression results would ind each of these

variables not signiicant! That is because, given that each of the two variables (height,

pant length) are telling us the same thing about a person's weight, neither variable

provides unique (i.e., “above and beyond the other variables”) predictive value, and,

statistically, the result is the correct one.

Obviously, what we really want is to retain one of the two variables in our pre-

dictive equation, but we do not need both variables. If you remove both variables

from the equation, you would be harming yourself with respect to getting the best

prediction of a person's weight that you can. In fact, if these were the only two

variables under consideration, and you drop them both, you would have nothing!!

Stepwise regression deals with this issue and would keep one of these two vari-

ables, whichever one was the tiniest bit more predictive than the other, and would bar

the other variable from being in the equation. The one variable of the two that is in

the equation is clearly signiicant, both statistically and intuitively.

10.6.1 HOW DOES STEPWISE REGRESSION WORK?

As silly as it sounds to say, we are saying it: stepwise regression works in steps!

It picks variables one at a time to enter into the equation. The entire process is

automated by the software.

The irst step is for the software to run a simple regression with Y and each of

the X's available. These regressions are run “internally”—you do not see (nor wish

to see!) that output; the software picks the variable with the highest r 2 value. Then it

displays (as you'll see) the results of this one (“winning”) regression.

In step 2, stepwise regression runs (internally) a bunch of new regressions; each

regression contains two X's, one being the winner from step 1, and every other X.

So, for example, if there are six X's to begin with (X1, X2, X3, X4, X5, X6), step 1

involves six simple regressions. Now let's assume that X3 has the highest r 2 , say,

0.35, and is, thus, considered the “winner.” In step 2, 5 regressions would be run;

they would involve two X's each, and all would include the X3. Ergo, the new step 2

regressions would be Y/(X1 and X3), Y/(X2 and X3), Y/(X4 and X3), Y/(X5 and

X3), and inally, Y/(X6 and X3). Next, which pair of X's together has the highest

r 2 is identiied. Imagine the overall r 2 with X3 and X6 is the highest, say, 0.59. This

two-variable regression would be displayed on the output.

Onward to step 3. Four regressions are run that contain X3 and X6 and each

other variable eligible (X1, X2, X4, and X5). Again, the highest overall r 2 of the four

regressions would be identiied and that variable would enter the equation. And so

forth—the process continues.

This may sound daunting, but don't forget that it is all automated by the software;

one click and you're done! Based on some other features of Stepwise Regression

Improving the User Experience through Practical Data Analytics

Search WWH ::

Custom Search

Home