Database Reference
In-Depth Information
workingday : This is whether the day was a working day or not
weathersit : This is a categorical variable that describes the weather at a par-
ticular time
temp : This is the normalized temperature
atemp : This is the normalized apparent temperature
hum : This is the normalized humidity
windspeed : This is the normalized wind speed
cnt : This is the target variable, that is, the count of bike rentals for that hour
We will work with the hourly data contained in hour.csv . If you look at the first line of
the dataset, you will see that it contains the column names as a header. You can do this by
running the following command:
>head -1 hour.csv
This should output the following result:
instant,dteday,season,yr,mnth,hr,holiday,weekday,workingday,weathersit,temp,atemp,hum,windspeed,casual,registered,cnt
Before we work with the data in Spark, we will again remove the header from the first
line of the file using the same sed command that we used previously to create a new file
called hour_noheader.csv :
>sed 1d hour.csv > hour_noheader.csv
Since we will be doing some plotting of our dataset later on, we will use the Python shell
for this chapter. This also serves to illustrate how to use MLlib's linear model and decision
tree functionality from PySpark.
Start up your PySpark shell from your Spark installation directory. If you want to use
IPython, which we highly recommend, remember to include the IPYTHON=1 environ-
ment variable together with the pylab functionality:
>IPYTHON=1 IPYTHON_OPTS="—pylab" ./bin/pyspark
If you prefer to use IPython Notebook, you can start it with the following command:
>IPYTHON=1 IPYTHON_OPTS=notebook ./bin/pyspark
Search WWH ::




Custom Search