Database Reference
In-Depth Information
Discussion
When the data values for two variables X and Y are stored in a database, the least-squares
regression for them can be calculated easily using aggregate functions. The same is true
for the correlation coefficient. The two calculations are actually fairly similar, and many
terms for performing the computations are common to the two procedures.
Suppose that you want to calculate a least-squares regression using the age and test score
values for the observations in the testscore table:
mysql> SELECT age, score FROM testscore;
+-----+-------+
| age | score |
+-----+-------+
| 5 | 5 |
| 5 | 4 |
| 5 | 6 |
| 5 | 7 |
| 6 | 8 |
| 6 | 9 |
| 6 | 4 |
| 6 | 6 |
| 7 | 8 |
| 7 | 6 |
| 7 | 9 |
| 7 | 7 |
| 8 | 9 |
| 8 | 6 |
| 8 | 7 |
| 8 | 10 |
| 9 | 9 |
| 9 | 7 |
| 9 | 10 |
| 9 | 9 |
+-----+-------+
The following equation expresses the regression line, where a and b are the intercept
and slope of the line:
Y = bX + a
Letting age be X and score be Y , begin by computing the terms needed for the regression
equation. These include the number of observations; the means, sums, and sums of
squares for each variable; and the sum of the products of each variable: 2
mysql> SELECT
-> @n := COUNT(score) AS N,
-> @meanX := AVG(age) AS 'X mean',
-> @sumX := SUM(age) AS 'X sum',
2. To see where these terms come from, consult any standard statistics text.
Search WWH ::




Custom Search