Combinations and Chances - Infobiotics: Information in Biotic Systems

Information Technology Reference

In-Depth Information

“scale factor” in the specific estimation of the confidence interval of c i ). In Fig. 7.9

the t -distribution t ( 14 ) , with 14 degrees of freedom, is depicted with significance

level

α =

0

.

05.

7.7.2

The Classical Stepwise Regression

In the previous section we have defined main concepts about multiple regression,

by indicating some statistical tests which are used to understand the correctness of

a given regression model. In this section, we address the problem of variable selec-

tion , that is, the problem of deciding which independent variables have to enter into

a multiple regression model, among a given set of possible variables. The simplest

method we can define consists of running all possible regressions, for all possible

choices of independent variables, and then choosing the best model by selecting the

one having the highest R 2 or the lowest MSE. This brute-force method has the prob-

lem that it considers a number of models which increases exponentially with respect

to the number of possible variables. In fact, the number of different models that we

can define by means of k independent variables is 2 k .

The stepwise regression algorithm provides a method for variable selection which

allows us to obtain good regression models with a lower complexity in time. This

algorithm does not necessarily find the best model among all possible 2 k models, but

it allows us to find a good model in a feasible time even when we need to consider

a high number of independent variables. The method uses a statistical test, again

based on F -distribution, which is called partial F-test , as it evaluates the relative

significance of a subset of all possible variables.

Suppose that a regression model of Y with k independent variables is postulated:

= β

+ β

+ ... + β k X k + ε .

Y

1 X 1

2 X 2

(7.25)

0

We will call this model the full model in the sense that it includes the maximal set

of independent variables. Now, suppose that we want to test the relative significance

of a subset of r of the k independent variables in the full model. The partial F -test

provides a statistical criterion for evaluating if the full model given in Eq. (7.25) is

better than the reduced model with only k

−

r variables:

Y

= β 0 + β 1 X 1 + β 2 X 2 + ... + β k − r X k − r + ε .

(7.26)

This corresponds to comparing the two hypotheses given in Table 7.8.

Ta b l e 7 . 8 Hypothesis of the partial F -test

H 0 : β k − r + 1 = β k − r + 2 = ... = β k = 0

H 1 : β k − r + 1 , β k − r + 2 ,..., β k are not all zero.

Infobiotics: Information in Biotic Systems

Search WWH ::

Custom Search

Home