Information Technology Reference
In-Depth Information
Fig. 7.7
Comparison between the predicting power of two regression models: a 13-degree
polynomial
Y
=
c
0
+
c
1
X
+
c
2
X
2
+
...
+
c
13
X
13
(depicted by the continuous line) and a least-
square line (depicted by the dotted line). The dataset used to calculate the models is the 14
points depicted as circles, while the last point represented by a star is the value of
Y
we predict
with our models. The 13-degree polynomial is a perfect example of model which overfits the
data. Namely, it fails completely the prediction of
Y
in the 15th data point
perfectly
Y
in the
n
data points, but it will give a very poor prediction of
Y
in other
points (see Fig. 7.7 for an example of overfitting).
The mean square error is a measure of the size of regression error, but does not
give any indication about the explained component of the regression. The
multiple
coefficient of determination
provides a measure of the capacity of a regression model
in explaining the dependent variable (
SSR
,
SST
,
SSE
as in Table 7.6):
SSR
SST
=
SSE
SST
.
R
2
=
1
−
(7.21)
The value
R
2
measures the part of the dependent variable variation which is ex-
plained by the combination of the independent variables in the multiple regression
model. In other words,
R
2
measures the percentage of
Y
explained by the regression
model in terms of
X
i
variables (even if it is not relevant here, the notation
R
2
is due
to the fact that it corresponds to the square of another statistical index). However,
R
2
has also a disadvantage, because it does not take into account the number of
independent variables used in the regression model. For this reason, the
adjusted
multiple coefficient of determination
is introduced (
MST
=
SST
/
(
n
−
1
))
:
SSE
/
[
n
−
(
k
+
1
)]
MSE
MST
R
2
=
1
−
=
1
−
(7.22)
SST
/
(
n
−
1
)
which accounts the degrees of freedom of SSE and SST, by giving a sort of penalty
to those models which fit the data well, but that are not parsimonious, as they use
too many independent variables.
When we define a new regression model, we need to avoid the use of independent
variables that are not related with the dependent variable
Y
. In other words, when
we try to define a regression model, we have to answer to this basic question:
is