Information Technology Reference
In-Depth Information
Fig. 7.7 Comparison between the predicting power of two regression models: a 13-degree
polynomial Y = c 0 + c 1 X + c 2 X 2
+ ... + c 13 X 13 (depicted by the continuous line) and a least-
square line (depicted by the dotted line). The dataset used to calculate the models is the 14
points depicted as circles, while the last point represented by a star is the value of Y we predict
with our models. The 13-degree polynomial is a perfect example of model which overfits the
data. Namely, it fails completely the prediction of Y in the 15th data point
perfectly Y in the n data points, but it will give a very poor prediction of Y in other
points (see Fig. 7.7 for an example of overfitting).
The mean square error is a measure of the size of regression error, but does not
give any indication about the explained component of the regression. The multiple
coefficient of determination provides a measure of the capacity of a regression model
in explaining the dependent variable ( SSR
,
SST
,
SSE as in Table 7.6):
SSR
SST =
SSE
SST .
R 2
=
1
(7.21)
The value R 2 measures the part of the dependent variable variation which is ex-
plained by the combination of the independent variables in the multiple regression
model. In other words, R 2 measures the percentage of Y explained by the regression
model in terms of X i variables (even if it is not relevant here, the notation R 2 is due
to the fact that it corresponds to the square of another statistical index). However,
R 2 has also a disadvantage, because it does not take into account the number of
independent variables used in the regression model. For this reason, the adjusted
multiple coefficient of determination is introduced ( MST
=
SST
/ (
n
1
))
:
SSE
/ [
n
(
k
+
1
)]
MSE
MST
R 2
=
1
=
1
(7.22)
SST
/ (
n
1
)
which accounts the degrees of freedom of SSE and SST, by giving a sort of penalty
to those models which fit the data well, but that are not parsimonious, as they use
too many independent variables.
When we define a new regression model, we need to avoid the use of independent
variables that are not related with the dependent variable Y . In other words, when
we try to define a regression model, we have to answer to this basic question: is
 
Search WWH ::




Custom Search