Database Reference
In-Depth Information
•
workingday
: This is whether the day was a working day or not
•
weathersit
: This is a categorical variable that describes the weather at a par-
ticular time
•
temp
: This is the normalized temperature
•
atemp
: This is the normalized apparent temperature
•
hum
: This is the normalized humidity
•
windspeed
: This is the normalized wind speed
•
cnt
: This is the target variable, that is, the count of bike rentals for that hour
We will work with the hourly data contained in
hour.csv
. If you look at the first line of
the dataset, you will see that it contains the column names as a header. You can do this by
running the following command:
>head -1 hour.csv
This should output the following result:
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
Before we work with the data in Spark, we will again remove the header from the first
line of the file using the same
sed
command that we used previously to create a new file
called
hour_noheader.csv
:
>sed 1d hour.csv > hour_noheader.csv
Since we will be doing some plotting of our dataset later on, we will use the Python shell
for this chapter. This also serves to illustrate how to use MLlib's linear model and decision
tree functionality from PySpark.
Start up your PySpark shell from your Spark installation directory. If you want to use
IPython, which we highly recommend, remember to include the
IPYTHON=1
environ-
ment variable together with the
pylab
functionality:
>IPYTHON=1 IPYTHON_OPTS="—pylab" ./bin/pyspark
If you prefer to use IPython Notebook, you can start it with the following command:
>IPYTHON=1 IPYTHON_OPTS=notebook ./bin/pyspark