Databases Reference
In-Depth Information
models, built using the same input attributes and values, the model with the
higher R
2 would always be preferred.
When comparing models built using the same input observations but where
the input attributes of one model are a subset of the input attributes of the other,
an adjusted R
2 is frequently computed. The adjustment to R 2 will factor a
penalty into the model built using the greater number of input attributes. Its
purpose is to overcome the tendency for R 2 to increase due to chance with the
addition of new attributes even when those attributes do not make any
contribution with respect to the total population.
Model Validity
The R 2 statistic of a regression model is an indicator of how well the model
performs. That is, howwell do the input values explain the variations in the output
values? In regression analysis, a related question is, how valid are the results?
Do the non-zero valued coefficients generated actually represent relationships
between the input and output columns, or is it possible that the coefficients are
chance occurrences resulting from the incompleteness of the dataset?
Looking Beyond R 2
Consider the four datasets in Figure 6.1, which were created by Anscombe in
1973 to illustrate possible pitfalls in linear regression. [Anscombe, F. J. “Graphs
in Statistical Analysis”. American Statistician 27 (1): 17-21]
When a simple linear regression modeler is applied to each of the four
datasets using y as the output column, the resulting models are nearly identical.
x
y
x
y
x
y
x
y
10
8.04
10
9.14
10
7.46
8
6.58
8
6.95
8
8.14
8
6.77
8
5.76
13
7.58
13
8.74
13
12.74
8
7.71
9
8.81
9
8.77
9
7.11
8
8.84
11
8.33
11
9.26
11
7.81
8
8.47
14
9.96
14
8.10
14
8.84
8
7.04
6
7.24
6
6.13
6
6.08
8
5.25
4
4.26
4
3.10
4
5.39
19
12.5
12
10.84
12
9.13
12
8.15
8
5.56
7
4.82
7
7.26
7
6.42
8
7.91
5
5.68
5
4.74
5
5.73
8
6.89
(a)
(b)
(c)
(d)
Figure 6.1
Regression Datasets
 
Search WWH ::




Custom Search