Database Reference
In-Depth Information
Linear regression modeling is all about determing how close a given observation is to an imaginary
line representing the average, or center of all points in the data set. That imaginary line gives us the
first part of the term “ line ar regression”. The formula for calculating a prediction using linear
regression is y=mx+b . You may recognize this from a former algebra class as the formula for
calculating the slope of a line. In this formula, the variable y , is the target, the label, the thing we
want to predict. So in this chapter's example, y is the amount of Heating_Oil we expect each
home to consume. But how will we predict y ? We need to know what m , x , and b are. The
variable m is the value for a given predictor attribute, or what is sometimes referred to as an
independent variable . Insulation, for example, is a predictor of heating oil usage, so Insulation is
a predictor attribute. The variable x is that attribute's coefficient, shown in the second column of
Figure 8-7. The coefficient is the amount of weight the attribute is given in the formula.
Insulation, with a coefficient of 3.323, is weighted heavier than any of the other predictor attributes
in this data set. Each observation will have its Insulation value multipled by the Insulation
coefficient to properly weight that attribute when calculating y (heating oil usage). The variable b is
a constant that is added to all linear regression calculations. It is represented by the Intercept,
shown in figure 8-7 as 134.511. So suppose we had a house with insulation density of 5; our
formula using these Insulation values would be y= (5*3.323)+134.511.
But wait! We had more than one predictor attribute. We started out using a combination of five
attributes to try to predict heating oil usage. The formula described in the previous paragraph only
uses one. Furthermore, our LinearRegression result set tab pictured in Figure 8-7 only has four
predictor variables. What happened to Num_Occupants?
The answer to the latter question is that Num_Occupants was not a statistically significant
predictor of heating oil usage in this data set, and therefore, RapidMiner removed it as a predictor.
In other words, when RapidMiner evaluated the amount of influence each attribute in the data set
had on heating oil usage for each home represented in the training data set, the number of
occupants was so non-influential that its weight in the formula was set to zero. An example of
why this might occur could be that two older people living in a house may use the same amount of
heating oil as a young family of five in the house. The older couple might take longer showers, and
prefer to keep their house much warmer in the winter time than would the young family. The
variability in the number of occupants in the house doesn't help to explain each home's heating oil
usage very well, and so it was removed as a predictor in our model.
Search WWH ::




Custom Search