Database Reference
In-Depth Information
the explanatory nature of the model is known as
overfitting.
To address the
possibility of overfitting the data, the adjusted R
2
accounts for the number of
parameters included in the linear regression model.
The F-statistic provides a method for testing the entire regression model. In the
previous
t
-tests, individual tests were conducted to determine the statistical
significance of each parameter. The provided F-statistic and corresponding p-value
enable the analyst to test the following hypotheses:
In this example, the p-value of 2.2e - 16 is small, which indicates that the null
hypothesis should be rejected.
Categorical Variables
In the previous example, the variable
Gender
was a simple binary variable that
indicated whether a person is female or male. In general, these variables are known
as
categorical variables.
To illustrate how to use categorical variables properly,
suppose it was decided in the earlier
Income
example to include an additional
variable,
State
, to represent the U.S. state where the person resides. Similar to
the use of the
Gender
variable, one possible, but incorrect, approach would be to
include a
State
variable that would take a value of 0 for Alabama, 1 for Alaska,
2 for Arizona, and so on. The problem with this approach is that such a numeric
assignment based on an alphabetical ordering of the states does not provide a
meaningful measure of the difference in the states. For example, is it useful or
proper to consider Arizona to be one unit greater than Alaska and two units greater
that Alabama?
In regression, a proper way to implement a categorical variable that can take
on
m
different values is to add
m-1
binary variables to the regression model.
To illustrate with the
Income
example, a binary variable for each of 49 states,
excluding Wyoming (arbitrarily chosen as the last of 50 states in an alphabetically
sorted list), could be added to the model.
results3 <- lm(Income˜Age + Education,
+ Alabama,
+ Alaska,
+ Arizona,