Can you relate? Correlation and simple linear regression - Improving the User Experience through Practical Data Analytics

Database Reference

In-Depth Information

OK, here's where things really get interesting. If we have a value of X, we can

insert it into the equation for the line, and compute Yc, the value of Y that is predicted

for the value of X we input. For example, if X = 3, we predict that Y is

1 . 1+0 . 833(3) =3.599

But wait, there's more! Check out the correlation coeficient, which is 0.895 (see

solid horizontal arrow in Figure 9.19, 3 labeled “Multiple R”). This is a reasonably

high value (and, of course, is the same value we found when we did a correlation

analysis with these same data earlier in the chapter). Loosely, but pragmatically inter-

preted, it means we should expect, for the most part, the predicted value of Y and the

actual value of Y to be reasonably close to one another. If we examine the data set, we

see that the average of the (two) Y values when X = 3 is 3.5, which, indeed, is close

to the predicted value, Yc, of 3.599.

If we look right below “Multiple R,” we see “R Square,” which equals 0.801.

As earlier, this indicates that a bit over 80% of the variability in Y (i.e., how come

Y is not always the same!!) is due to the fact that X is not always the same. Indeed,

if X were always the same, the variability in Y would be only about 20% as much

as it is now.

In addition to the least-squares line and the correlation coeficient (and its square,

r 2 , the coeficient of determination), there are a few other noteworthy values in the

output of Figure 9.19 .

If you look at the bottom right of the output (see vertical arrow), you see a 95%

conidence interval for each of the coeficients (i.e., intercept and slope 4 ). Let's take

them one by one.

Our best estimate of the intercept is 1.1; however, a 95% conidence interval for the

true value of the intercept is −1.34 to 3.54. However, we can see that the intercept is

not signiicant, since its p -value is 0.24 (see the bent arrow in Figure 9.19 ). Therefore,

we cannot rule out that its true value equals zero. Quite often, however, the intercept

is not a quantity that, by itself, is of great value to us.

Now let's look at the conidence interval for the slope. Keep in mind that the

slope is crucially important; whether it's zero or not directly indicates whether the

variables are actually related. Here, we get a value for the slope of 0.833. The 95%

conidence interval of the true slope is 0.071 to 1.596. Its p -value (0.040) is below

the traditional 0.05 benchmark value. Therefore, at signiicance level equal to 0.05,

the slope is statistically signiicant.

3 The reader will note that the correlation is labeled “Multiple R.” This is simply relecting oversimpli-

ication (sloth?) on Excel's part. Excel did not want to bother writing simple R when there is only one

X, and multiple R when there is more than one X, and decided to just write multiple R no matter how

many X's there are. We obviously weren't involved in the usability testing. ☺

4 The reader may note that the conidence intervals for the intercept and for the slope are each written

twice! This, again, is simply relecting laziness on Excel's part. You can specify a conidence level

other than 95%, and if you do, Excel gives that conidence interval to you, but also, automatically, gives

you the conidence interval for 95%. If you do not specify another conidence level (and one virtually

never does so), Excel gives you the conidence interval for the 95% default and then gives you the

automatic one for 95%.

Improving the User Experience through Practical Data Analytics

Search WWH ::

Custom Search

Home