Building a Regression Model with Spark - Machine Learning with Spark

Database Reference

In-Depth Information

You can type all the code that follows for the remainder of this chapter directly into your

PySpark shell (or into IPython Notebook if you wish to use it).

Tip

Recall that we used the IPython shell in Chapter 3 , Obtaining, Processing, and Preparing

Data with Spark . Take a look at that chapter and the code bundle for instructions to install

IPython.

We'll start as usual by loading the dataset and inspecting it:

path = "/ PATH /hour_noheader.csv"

raw_data = sc.textFile(path)

num_data = raw_data.count()

records = raw_data.map(lambda x: x.split(","))

first = records.first()

print first

print num_data

You should see the following output:

[u'1', u'2011-01-01', u'1', u'0', u'1', u'0', u'0', u'6',

u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13',

u'16']

17379

So, we have 17,379 hourly records in our dataset. We have inspected the column names

already. We will ignore the record ID and raw date columns. We will also ignore the

casual and registered count target variables and focus on the overall count vari-

able, cnt (which is the sum of the other two counts). We are left with 12 variables. The

first eight are categorical, while the last 4 are normalized real-valued variables.

To deal with the eight categorical variables, we will use the binary encoding approach

with which you should be quite familiar by now. The four real-valued variables will be

left as is.

We will first cache our dataset, since we will be reading from it many times:

records.cache()

Search WWH ::

Custom Search

Home