Database Reference
In-Depth Information
6.3 Reasons to Choose and Cautions
Linear regression is suitable when the input variables are continuous or discrete,
including categorical data types, but the outcome variable is continuous. If the
outcome variable is categorical, logistic regression is a better choice.
Both models assume a linear additive function of the input variables. If such an
assumption does not hold true, both regression techniques perform poorly.
Furthermore, in linear regression, the assumption of normally distributed error
terms with a constant variance is important for many of the statistical inferences
that can be considered. If the various assumptions do not appear to hold, the
appropriate transformations need to be applied to the data.
Although a collection of input variables may be a good predictor for the outcome
variable, the analyst should not infer that the input variables directly cause an
outcome. For example, it may be identified that those individuals who have regular
dentist visits may have a reduced risk of heart attacks. However, simply sending
someone to the dentist almost certainly has no effect on that person's chance of
having a heart attack. It is possible that regular dentist visits may indicate a person's
overall health and dietary choices, which may have a more direct impact on a
person's health. This example illustrates the commonly known expression,
“Correlation does not imply causation.”
Use caution when applying an already fitted model to data that falls outside the
dataset used to train the model. The linear relationship in a regression model may
no longer hold at values outside the training dataset. For example, if income was
an input variable and the values of income ranged from $35,000 to $90,000,
applying the model to incomes well outside those incomes could result in inaccurate
estimates and predictions.
The income regression example in Section 6.1.2 mentioned the possibility of using
categorical variables to represent the 50 U.S. states. In a linear regression model,
the state of residence would provide a simple additive term to the income model
but no other impact on the coefficients of the other input variables, such as Age
and Education . However, if state does influence the other variables' impact to
the income model, an alternative approach would be to build 50 separate linear
regression models: one model for each state. Such an approach is an example of the
options and decisions that the data scientist must be willing to consider.
If several of the input variables are highly correlated to each other, the condition
is known as multicollinearity. Multicollinearity can often lead to coefficient
Search WWH ::




Custom Search