Can you relate? Correlation and simple linear regression - Improving the User Experience through Practical Data Analytics

Database Reference

In-Depth Information

“Sure, how's early afternoon tomorrow?”

“Perfect, I've got my running 1pm meeting with him tomorrow. Maybe the

Chianti he likes during lunch will defuse the inevitable explosion!”

9.7 SUMMARY

In this chapter, we have introduced correlation and regression analysis. Both of

these techniques deal with the relationship between a “dependent variable” or out-

put variable that we label “Y,” and an “independent variable” or input variable that

we label “X.”

The correlation, r, is a dimensionless quantity that ranges between −1 and 1, and

indicates the strength and direction of a linear relationship between the two variables;

the (hypothesis) test of its signiicance is also discussed. We also note that the coef-

icient of determination, r 2 , has a direct interpretation as the proportion of variability

in Y explained by X (in a linear relationship).

We consider example scatter diagrams (graphs of the X, Y points) and discuss

how they correspond with the respective values of r. We also demonstrate in both

Excel and SPSS how to obtain the correlation.

Regression analysis quantiies the linear relationship between Y and X, by pro-

viding a least-squares line from which we can input a value of X and obtain a pre-

dicted (best estimate) value of Y, using the line's corresponding slope and intercept.

We note how to perform a regression analysis in both Excel and SPSS, and discuss

various conidence intervals of interest, as well as hypothesis testing to decide if we

should conclude that there truly is a linear relationship between Y and X “beyond a

reasonable doubt.” In each case—correlation and regression—our illustrations use a

small data set that is easier for the reader to follow, and then we apply the technique

to the prototype real-world data from Behemoth.com.

9.8 ADDENDUM: A QUICK DISCUSSION OF SOME

ASSUMPTIONS IMPLICIT IN INTERPRETING

THE RESULTS

When we perform “statistical inference” (more or less, for us, conidence intervals,

and hypothesis testing) in a correlation or regression analysis, there are three theo-

retical assumptions we are technically making.

One assumption, called “normality,” says that if we hold X constant at any (and

every) value, and were to look at many values of Y at that X value, the Y values

would form a normal distribution.

A second assumption, called “constant variability” (or often by the ugly word,

“homoscedasticity,” which is said to mean “constant variability” in Greek [and

sometimes, it is spelled with the irst “c” being a “k”]), says that the normal curves

for each X have the same variability (which as we might recall from Chapter 1,

Search WWH ::

Custom Search

Home