Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

• workingday : This is whether the day was a working day or not

• weathersit : This is a categorical variable that describes the weather at a par-

ticular time

• temp : This is the normalized temperature

• atemp : This is the normalized apparent temperature

• hum : This is the normalized humidity

• windspeed : This is the normalized wind speed

• cnt : This is the target variable, that is, the count of bike rentals for that hour

We will work with the hourly data contained in hour.csv . If you look at the first line of

the dataset, you will see that it contains the column names as a header. You can do this by

running the following command:

>head -1 hour.csv

This should output the following result:

instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt

Before we work with the data in Spark, we will again remove the header from the first

line of the file using the same sed command that we used previously to create a new file

called hour_noheader.csv :

>sed 1d hour.csv > hour_noheader.csv

Since we will be doing some plotting of our dataset later on, we will use the Python shell

for this chapter. This also serves to illustrate how to use MLlib's linear model and decision

tree functionality from PySpark.

Start up your PySpark shell from your Spark installation directory. If you want to use

IPython, which we highly recommend, remember to include the IPYTHON=1 environ-

ment variable together with the pylab functionality:

>IPYTHON=1 IPYTHON_OPTS="—pylab" ./bin/pyspark

If you prefer to use IPython Notebook, you can start it with the following command:

>IPYTHON=1 IPYTHON_OPTS=notebook ./bin/pyspark

Search WWH ::

Custom Search

Home