Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

Transforming the target variable

Recall that many machine learning models, including linear models, make assumptions re-

garding the distribution of the input data as well as target variables. In particular, linear re-

gression assumes a normal distribution.

In many real-world cases, the distributional assumptions of linear regression do not hold. In

this case, for example, we know that the number of bike rentals can never be negative. This

alone should indicate that the assumption of normality might be problematic. To get a bet-

ter idea of the target distribution, it is often a good idea to plot a histogram of the target val-

ues.

In this section, if you are using IPython Notebook, enter the magic function, %pylab in-

line , to import pylab (that is, the numpy and matplotlib plotting functions) into the

workspace. This will also create any figures and plots inline within the Notebook cell.

If you are using the standard IPython console, you can use %pylab to import the neces-

sary functionality (your plots will appear in a separate window).

We will now create a plot of the target variable distribution in the following piece of code:

targets = records.map(lambda r: float(r[-1])).collect()

hist(targets, bins=40, color='lightblue', normed=True)

fig = matplotlib.pyplot.gcf()

fig.set_size_inches(16, 10)

Looking at the histogram plot, we can see that the distribution is highly skewed and cer-

tainly does not follow a normal distribution:

Search WWH ::

Custom Search

Home