Database Reference
In-Depth Information
You can type all the code that follows for the remainder of this chapter directly into your
PySpark shell (or into IPython Notebook if you wish to use it).
Tip
Recall that we used the IPython shell in Chapter 3 , Obtaining, Processing, and Preparing
Data with Spark . Take a look at that chapter and the code bundle for instructions to install
IPython.
We'll start as usual by loading the dataset and inspecting it:
path = "/ PATH /hour_noheader.csv"
raw_data = sc.textFile(path)
num_data = raw_data.count()
records = raw_data.map(lambda x: x.split(","))
first = records.first()
print first
print num_data
You should see the following output:
[u'1', u'2011-01-01', u'1', u'0', u'1', u'0', u'0', u'6',
u'0', u'1', u'0.24', u'0.2879', u'0.81', u'0', u'3', u'13',
u'16']
17379
So, we have 17,379 hourly records in our dataset. We have inspected the column names
already. We will ignore the record ID and raw date columns. We will also ignore the
casual and registered count target variables and focus on the overall count vari-
able, cnt (which is the sum of the other two counts). We are left with 12 variables. The
first eight are categorical, while the last 4 are normalized real-valued variables.
To deal with the eight categorical variables, we will use the binary encoding approach
with which you should be quite familiar by now. The four real-valued variables will be
left as is.
We will first cache our dataset, since we will be reading from it many times:
records.cache()
Search WWH ::




Custom Search