Can you relate? Correlation and simple linear regression - Improving the User Experience through Practical Data Analytics

Database Reference

In-Depth Information

So, we now can formally conclude that the two variables are, indeed, linearly related.

SIDEBAR: EXCEL'S WEIRD LABEL FOR THE P -VALUE FOR THE

F-STATISTIC

We want to add that the middle section of the output, the ANOVA table (you saw ANOVA tables

in several earlier chapters), gives you a p -value also, relative to the F-statistic. You can see the

F-statistic value of 12.097 (see curved arrow in Figure 9.19 ); its p -value is just to the right of it and

equals 0.040. But, wait a moment!!!! This value is exactly the same as the p -value for the slope!!

For reasons unknown to the authors, Excel calls the p -value for the F-statistic “Signiicance F,”

but we assure you that this is the p -value (and should be called p -value!!). Any time we are running

a simple regression (recall: this means there is only one X variable), the F-statistic will have the

same p -value as the p -value for the slope (t-test) , and provide exactly the same information content.

In fact, in writing up a report on the results of a simple regression, you would not want to separately

discuss the two p -values, since it would be a redundancy. In the next chapter, Chapter 10, the p -value

for the F-statistic and that for the slope will have different values and will mean different things.

There is one inal thing that we wish to impart about the output in Figure 9.19 ,

and that is the “Standard Error,” as listed in row 7 in the top section of the output

(see dashed horizontal arrow in Figure 9.19 ). Its value equals 0.587, and its notation

is often: Sy.x. This is a key value for inding a conidence interval for a prediction,

often a very important thing to ind. In essence, this is the standard deviation estimate

of the error of a prediction if we had the correct regression line. However, we do not

have the exact correct regression line (inding which, in theory, would require ininite

data!!). However, if the sample size is reasonably large (say, at least 25), and we are

predicting for a value of X that is near the mean of our data, we can, as an approxima-

tion, use the standard error value as if it were the overall standard deviation of the pre-

diction. With this caveat, the formula for a 95% conidence interval for a prediction is

Yc±TINV(0 . 05 , n −2) *Sy.x,

where “ n ” is the sample size (in this example, n = 5) and TINV is an Excel command

that provides a value from the t-distribution. The irst value (i.e., 0.05) relects wanting

95% conidence—it would be 0.01 for 99% conidence, 0.10 for 90% conidence, etc.;

the second value, ( n −2), is a degrees-of-freedom number—you really don't need to

know the details/derivation of why that value is what it is—it is easy to determine, since

you know the value of n , the sample size, and hence, you obviously know the value of

( n −2). For our earlier example, where we predicted a value of Yc to be 3.599, a 95%

conidence interval for what the value will actually come out for an individual person is:

3 . 599±TINV(0 . 05 , 3) * (0 . 587)

3 . 599± (3 . 182) * (0 . 587)

3 . 599±1 . 865

or

(1 . 734to5),

with the realization that we cannot get a value that exceeds 5.

Search WWH ::

Custom Search

Home